llama_paddle

10f294ff · yuguo-Jack · 7c64e6ec · 10f294ff · 10f294ff · 10f294ff
Commit 10f294ff authored Dec 19, 2023 by yuguo-Jack
20 changed files
--- a/applications/information_extraction/text/data_distill/README_en.md
+++ b/applications/information_extraction/text/data_distill/README_en.md
+# UIE Slim data distillation
+While UIE has powerful zero-shot extraction capabilities, its prompting structure requires significant compute to serve in real time. Some industrial application scenarios have high inference performance requirements and the model cannot go into production without being effectively compressed. We built the UIE Slim Data Distillation with knowledge distillation techniques. The principle is to use the data as a bridge to transfer the knowledge of the UIE model to the smaller closed-domain information extraction model in order to achieve speedup inference significantly with minimal loss to accuracy.
+#### Three steps of UIE data distillation
+- **Step 1**: Finetune the UIE model on the labeled data to get the Teacher Model.
+- **Step 2**: Process the user-provided unlabeled data and run inference with Taskflow UIE.
+- **Step 3**: Use the labeled data and the inference results obtained in step 2 to train a closed-domain Student Model.
+## UIE Finetune
+Refer to [UIE relationship extraction fine-tuning](../README.md) to complete the model fine-tuning and get ``../checkpoint/model_best``.
+## Offline Distillation
+#### Predict the label of unsupervised data through the trained UIE custom model
+```shell
+python data_distill.py \
+     --data_path ../data \
+     --save_dir student_data \
+     --task_type relation_extraction \
+     --synthetic_ratio 10 \
+     --model_path ../checkpoint/model_best
+```
+**NOTE**: The schema needs to be configured in `data_distill.py` according to the label data, and the schema needs to contain all label types in the label data.
+Description of configurable parameters:
+- `data_path`: Path to labeled data (`doccano_ext.json`) and unsupervised text (`unlabeled_data.txt`).
+- `model_path`: The path of the trained UIE custom model.
+- `save_dir`: The path to save the training data of the student model.
+- `synthetic_ratio`: Controls the ratio of synthetic data. The maximum number of synthetic data=synthetic_ratio*number of labeled data.
+- `platform`: The labeling platform used to label data, optional are `doccano`, `label_studio`, the default is `label_studio`.
+- `task_type`: Select the task type, optional are `entity_extraction`, `relation_extraction`, `event_extraction` and `opinion_extraction`. Because it is a closed-domain extraction, the post-processing logic of different tasks is different, so the task type needs to be specified.
+- `seed`: random seed, default is 1000.
+#### Teacher model evaluation
+In the UIE fine-tuning stage, the model performance is evaluated on UIE training format data, which is not a standard end-to-end evaluation method for relation extraction or event extraction. The end-to-end evaluation can be performed through the following evaluation script.
+```shell
+python evaluate_teacher.py \
+     --task_type relation_extraction \
+     --test_path ./student_data/dev_data.json\
+     --label_maps_path ./student_data/label_maps.json \
+     --model_path ../checkpoint/model_best
+```
+Description of configurable parameters:
+- `model_path`: The path of the trained UIE custom model.
+- `test_path`: test dataset path.
+- `label_maps_path`: dictionary of student model labels.
+- `batch_size`: batch size, default is 8.
+- `max_seq_len`: Maximum text length, default is 256.
+- `task_type`: Select the task type, optional are `entity_extraction`, `relation_extraction`, `event_extraction` and `opinion_extraction`. Because it is an evaluation of closed-domain information extraction, the task type needs to be specified.
+#### Student model training
+```shell
+python train.py\
+     --task_type relation_extraction \
+     --train_path student_data/train_data.json \
+     --dev_path student_data/dev_data.json \
+     --label_maps_path student_data/label_maps.json \
+     --num_epochs 50 \
+     --encoder ernie-3.0-mini-zh
+```
+Description of configurable parameters:
+- `train_path`: training set file path.
+- `dev_path`: Validation set file path.
+- `batch_size`: batch size, default is 16.
+- `learning_rate`: Learning rate, default is 3e-5.
+- `save_dir`: model storage path, the default is `./checkpoint`.
+- `max_seq_len`: Maximum text length, default is 256.
+- `weight_decay`: Indicates the coefficient of weight_decay used in the AdamW optimizer.
+- `warmup_proportion`: The proportion of the learning rate warmup strategy. If it is 0.1, the learning rate will slowly increase from 0 to learning_rate during the first 10% training step, and then slowly decay. The default is 0.0.
+- `num_epochs`: The number of training epochs, the default is 100.
+- `seed`: random seed, default is 1000.
+- `encoder`: select the model base of the student model, the default is `ernie-3.0-mini-zh`.
+- `task_type`: Select the task type, optional are `entity_extraction`, `relation_extraction`, `event_extraction` and `opinion_extraction`. Because it is closed-domain information extraction, the task type needs to be specified.
+- `logging_steps`: The interval steps of log printing, the default is 10.
+- `eval_steps`: The interval steps of evaluate, the default is 200.
+- `device`: What device to choose for training, optional cpu or gpu.
+- `init_from_ckpt`: optional, model parameter path, hot start model training; default is None.
+#### Student model evaluation
+```shell
+python evaluate.py \
+     --model_path ./checkpoint/model_best \
+     --test_path student_data/dev_data.json \
+     --task_type relation_extraction \
+     --label_maps_path student_data/label_maps.json \
+     --encoder ernie-3.0-mini-zh
+```
+Description of configurable parameters:
+- `model_path`: The path of the trained UIE custom model.
+- `test_path`: test dataset path.
+- `label_maps_path`: dictionary of student model labels.
+- `batch_size`: batch size, default is 8.
+- `max_seq_len`: Maximum text length, default is 256.
+- `encoder`: select the model base of the student model, the default is `ernie-3.0-mini-zh`.
+- `task_type`: Select the task type, optional are `entity_extraction`, `relation_extraction`, `event_extraction` and `opinion_extraction`. Because it is an evaluation of closed-domain information extraction, the task type needs to be specified.
+## Student model deployment
+- Fast deployment of the closed-domain information extraction model through Taskflow, `task_path` is the path of the student model.
+```python
+>>> from pprint import pprint
+>>> from paddlenlp import Taskflow
+>>> my_ie = Taskflow("information_extraction", model="uie-data-distill-gp", task_path="checkpoint/model_best/") # Schema is fixed in closed-domain information extraction
+>>> pprint(my_ie("Virgo deceleration bomb was developed by the Swedish FFV Ordnance Company specially for the attack aircraft of the Swedish Royal Air Force to carry out low-altitude and high-speed bombing. It was developed in 1956 and entered service in 1963. It is equipped on the A32 "Contradiction", A35 "Dragon", and AJ134 "Thunder" attack aircraft are mainly used to attack landing craft, parked aircraft, anti-aircraft artillery, field artillery, light armored vehicles and active forces."))
+[{'weapon name': [{'end': 14,
+             'probability': 0.9976037,
+             'relations': {'country of origin': [{'end': 18,
+                                   'probability': 0.9988706,
+                                   'relations': {},
+                                   'start': 16,
+                                   'text': 'Sweden'}],
+                           'R&D unit': [{'end': 25,
+                                     'probability': 0.9978277,
+                                     'relations': {},
+                                     'start': 18,
+                                     'text': 'FFV Ordnance Company'}],
+                           'type': [{'end': 14,
+                                   'probability': 0.99837446,
+                                   'relations': {},
+                                   'start': 12,
+                                   'text': 'bomb'}]},
+             'start': 0,
+             'text': 'Virgo slowing bomb'}]}]
+```
+# References
+- **[GlobalPointer](https://kexue.fm/search/globalpointer/)**
+- **[GPLinker](https://kexue.fm/archives/8888)**
+- **[JunnYu/GPLinker_pytorch](https://github.com/JunnYu/GPLinker_pytorch**
--- a/applications/information_extraction/text/data_distill/criterion.py
+++ b/applications/information_extraction/text/data_distill/criterion.py
+# coding=utf-8
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import numpy as np
+import paddle
+import paddle.nn as nn
+class Criterion(nn.Layer):
+    """Criterion for GPNet"""
+    def __init__(self, mask_zero=True):
+        self.mask_zero = mask_zero
+    def _sparse_multilabel_categorical_crossentropy(self, y_true, y_pred, mask_zero=False):
+        """Sparse multi-label categorical cross entropy
+        reference to "https://kexue.fm/archives/7359".
+        """
+        zeros = paddle.zeros_like(y_pred[..., :1])
+        y_pred = paddle.concat([y_pred, zeros], axis=-1)
+        if mask_zero:
+            infs = zeros + 1e12
+            y_pred = paddle.concat([infs, y_pred[..., 1:]], axis=-1)
+        y_pos_2 = paddle.take_along_axis(y_pred, y_true, axis=-1)
+        y_pos_1 = paddle.concat([y_pos_2, zeros], axis=-1)
+        if mask_zero:
+            y_pred = paddle.concat([-infs, y_pred[..., 1:]], axis=-1)
+            y_pos_2 = paddle.take_along_axis(y_pred, y_true, axis=-1)
+        pos_loss = (-y_pos_1).exp().sum(axis=-1).log()
+        all_loss = y_pred.exp().sum(axis=-1).log()
+        aux_loss = y_pos_2.exp().sum(axis=-1).log() - all_loss
+        aux_loss = paddle.clip(1 - paddle.exp(aux_loss), min=0.1, max=1)
+        neg_loss = all_loss + paddle.log(aux_loss)
+        return pos_loss + neg_loss
+    def __call__(self, y_pred, y_true):
+        shape = y_pred.shape
+        y_true = y_true[..., 0] * shape[2] + y_true[..., 1]
+        # bs, nclass, seqlen * seqlen
+        y_pred = paddle.reshape(y_pred, shape=[shape[0], -1, np.prod(shape[2:])])
+        loss = self._sparse_multilabel_categorical_crossentropy(y_true, y_pred, self.mask_zero)
+        return loss.sum(axis=1).mean()
--- a/applications/information_extraction/text/data_distill/data_collator.py
+++ b/applications/information_extraction/text/data_distill/data_collator.py
+# coding=utf-8
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from dataclasses import dataclass
+from typing import Dict, List, Optional, Union
+import paddle
+from paddlenlp.transformers.tokenizer_utils_base import (
+    PaddingStrategy,
+    PretrainedTokenizerBase,
+)
+ignore_list = ["offset_mapping", "text"]
+@dataclass
+class DataCollator:
+    tokenizer: PretrainedTokenizerBase
+    padding: Union[bool, str, PaddingStrategy] = True
+    max_length: Optional[int] = None
+    label_maps: Optional[dict] = None
+    task_type: Optional[str] = None
+    def __call__(self, features: List[Dict[str, Union[List[int], paddle.Tensor]]]) -> Dict[str, paddle.Tensor]:
+        labels = [feature["labels"] for feature in features] if "labels" in features[0].keys() else None
+        new_features = [{k: v for k, v in f.items() if k not in ["labels"] + ignore_list} for f in features]
+        batch = self.tokenizer.pad(
+            new_features,
+            padding=self.padding,
+        )
+        batch = [paddle.to_tensor(batch[k]) for k in batch.keys()]
+        if labels is None:  # for test
+            if "offset_mapping" in features[0].keys():
+                batch.append([feature["offset_mapping"] for feature in features])
+            if "text" in features[0].keys():
+                batch.append([feature["text"] for feature in features])
+            return batch
+        bs = batch[0].shape[0]
+        if self.task_type == "entity_extraction":
+            # Ensure the dimension is greater or equal to 1
+            max_ent_num = max(max([len(lb["ent_labels"]) for lb in labels]), 1)
+            num_ents = len(self.label_maps["entity2id"])
+            batch_entity_labels = paddle.zeros(shape=[bs, num_ents, max_ent_num, 2], dtype="int64")
+            for i, lb in enumerate(labels):
+                for eidx, (l, eh, et) in enumerate(lb["ent_labels"]):
+                    batch_entity_labels[i, l, eidx, :] = paddle.to_tensor([eh, et])
+            batch.append([batch_entity_labels])
+        else:
+            # Ensure the dimension is greater or equal to 1
+            max_ent_num = max(max([len(lb["ent_labels"]) for lb in labels]), 1)
+            max_spo_num = max(max([len(lb["rel_labels"]) for lb in labels]), 1)
+            num_ents = len(self.label_maps["entity2id"])
+            if "relation2id" in self.label_maps.keys():
+                num_rels = len(self.label_maps["relation2id"])
+            else:
+                num_rels = len(self.label_maps["sentiment2id"])
+            batch_entity_labels = paddle.zeros(shape=[bs, num_ents, max_ent_num, 2], dtype="int64")
+            batch_head_labels = paddle.zeros(shape=[bs, num_rels, max_spo_num, 2], dtype="int64")
+            batch_tail_labels = paddle.zeros(shape=[bs, num_rels, max_spo_num, 2], dtype="int64")
+            for i, lb in enumerate(labels):
+                for eidx, (l, eh, et) in enumerate(lb["ent_labels"]):
+                    batch_entity_labels[i, l, eidx, :] = paddle.to_tensor([eh, et])
+                for spidx, (sh, st, p, oh, ot) in enumerate(lb["rel_labels"]):
+                    batch_head_labels[i, p, spidx, :] = paddle.to_tensor([sh, oh])
+                    batch_tail_labels[i, p, spidx, :] = paddle.to_tensor([st, ot])
+            batch.append([batch_entity_labels, batch_head_labels, batch_tail_labels])
+        return batch
--- a/applications/information_extraction/text/data_distill/data_distill.py
+++ b/applications/information_extraction/text/data_distill/data_distill.py
+# coding=utf-8
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import argparse
+import json
+import math
+import os
+import random
+from tqdm import tqdm
+from utils import anno2distill, schema2label_maps, set_seed, synthetic2distill
+from paddlenlp import Taskflow
+from paddlenlp.utils.log import logger
+def do_data_distill():
+    set_seed(args.seed)
+    # Generate closed-domain label maps
+    if not os.path.exists(args.save_dir):
+        os.mkdir(args.save_dir)
+    label_maps = schema2label_maps(args.task_type, schema=args.schema)
+    label_maps_path = os.path.join(args.save_dir, "label_maps.json")
+    # Save closed-domain label maps file
+    with open(label_maps_path, "w", encoding="utf-8") as fp:
+        fp.write(json.dumps(label_maps, ensure_ascii=False))
+    # Load doccano file and convert to distill format
+    sample_index = json.loads(
+        open(os.path.join(args.data_path, "sample_index.json"), "r", encoding="utf-8").readline()
+    )
+    train_ids = sample_index["train_ids"]
+    dev_ids = sample_index["dev_ids"]
+    test_ids = sample_index["test_ids"]
+    if args.platform == "label_studio":
+        with open(os.path.join(args.data_path, "label_studio.json"), "r", encoding="utf-8") as fp:
+            json_lines = json.loads(fp.read())
+    elif args.platform == "doccano":
+        json_lines = []
+        with open(os.path.join(args.data_path, "doccano_ext.json"), "r", encoding="utf-8") as fp:
+            for line in fp:
+                json_lines.append(json.loads(line))
+    else:
+        raise ValueError("Unsupported annotation platform!")
+    train_lines = [json_lines[i] for i in train_ids]
+    train_lines = anno2distill(train_lines, args.task_type, label_maps, args.platform)
+    dev_lines = [json_lines[i] for i in dev_ids]
+    dev_lines = anno2distill(dev_lines, args.task_type, label_maps, args.platform)
+    test_lines = [json_lines[i] for i in test_ids]
+    test_lines = anno2distill(test_lines, args.task_type, label_maps, args.platform)
+    # Load trained UIE model
+    uie = Taskflow("information_extraction", schema=args.schema, task_path=args.model_path)
+    if args.synthetic_ratio > 0:
+        # Generate synthetic data
+        texts = open(os.path.join(args.data_path, "unlabeled_data.txt"), "r", encoding="utf-8").readlines()
+        actual_ratio = math.ceil(len(texts) / len(train_lines))
+        if actual_ratio <= args.synthetic_ratio or args.synthetic_ratio == -1:
+            infer_texts = texts
+        else:
+            idxs = random.sample(range(0, len(texts)), args.synthetic_ratio * len(train_lines))
+            infer_texts = [texts[i] for i in idxs]
+        infer_results = []
+        for text in tqdm(infer_texts, desc="Predicting: ", leave=False):
+            infer_results.extend(uie(text))
+        train_synthetic_lines = synthetic2distill(infer_texts, infer_results, args.task_type)
+        # Concat origin and synthetic data
+        train_lines.extend(train_synthetic_lines)
+    def _save_examples(save_dir, file_name, examples):
+        count = 0
+        save_path = os.path.join(save_dir, file_name)
+        with open(save_path, "w", encoding="utf-8") as f:
+            for example in examples:
+                f.write(json.dumps(example, ensure_ascii=False) + "\n")
+                count += 1
+        logger.info("Save %d examples to %s." % (count, save_path))
+    _save_examples(args.save_dir, "train_data.json", train_lines)
+    _save_examples(args.save_dir, "dev_data.json", dev_lines)
+    _save_examples(args.save_dir, "test_data.json", test_lines)
+if __name__ == "__main__":
+    # yapf: disable
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--data_path", default="../data", type=str, help="The directory for labeled data with doccano format and the large scale unlabeled data.")
+    parser.add_argument("--model_path", type=str, default="../checkpoint/model_best", help="The path of saved model that you want to load.")
+    parser.add_argument("--save_dir", default="./distill_task", type=str, help="The path of data that you wanna save.")
+    parser.add_argument("--synthetic_ratio", default=10, type=int, help="The ratio of labeled and synthetic samples.")
+    parser.add_argument("--task_type", choices=['relation_extraction', 'event_extraction', 'entity_extraction', 'opinion_extraction'], default="entity_extraction", type=str, help="Select the training task type.")
+    parser.add_argument("--seed", type=int, default=1000, help="Random seed for initialization")
+    parser.add_argument("--platform", choices=['doccano', 'label_studio'], type=str, default="label_studio", help="Select the annotation platform.")
+    args = parser.parse_args()
+    # yapf: enable
+    # Define your schema here
+    schema = {"武器名称": ["产国", "类型", "研发单位"]}
+    args.schema = schema
+    do_data_distill()
--- a/applications/information_extraction/text/data_distill/deploy/simple_serving/README.md
+++ b/applications/information_extraction/text/data_distill/deploy/simple_serving/README.md
+# 基于PaddleNLP SimpleServing 的服务化部署
+## 目录
+- [环境准备](#环境准备)
+- [Server服务启动](#Server服务启动)
+- [Client请求启动](#Client请求启动)
+- [服务化自定义参数](#服务化自定义参数)
+## 环境准备
+使用有SimpleServing功能的PaddleNLP版本(或者最新的develop版本)
+```shell
+pip install paddlenlp >= 2.4.4
+```
+## Server服务启动
+```bash
+paddlenlp server server:app --workers 1 --host 0.0.0.0 --port 8189
+```
+## Client请求启动
+```bash
+python client.py
+```
+## 服务化自定义参数
+### Server 自定义参数
+#### schema替换
+```python
+# Default schema
+schema = {"武器名称": ["产国", "类型", "研发单位"]}
+```
+#### 设置模型路径
+```
+# Default task_path
+uie = Taskflow('information_extraction', model='uie-data-distill-gp', task_path='../../checkpoint/model_best/', schema=schema)
+```
+#### 多卡服务化预测
+PaddleNLP SimpleServing 支持多卡负载均衡预测，主要在服务化注册的时候，注册两个Taskflow的task即可，下面是示例代码
+```
+uie1 = Taskflow('information_extraction', model='uie-data-distill-gp', task_path='../../checkpoint/model_best/', schema=schema, device_id=0)
+uie2 = Taskflow('information_extraction', model='uie-data-distill-gp', task_path='../../checkpoint/model_best/', schema=schema, device_id=1)
+service.register_taskflow('uie', [uie1, uie2])
+```
+### Client 自定义参数
+```python
+# Changed to input texts you wanted
+texts = ['威尔哥（Virgo）减速炸弹是由瑞典FFV军械公司专门为瑞典皇家空军的攻击机实施低空高速轰炸而研制，1956年开始研制，1963年进入服役，装备于A32“矛盾”、A35“龙”、和AJ134“雷”攻击机，主要用于攻击登陆艇、停放的飞机、高炮、野战火炮、轻型防护装甲车辆以及有生力量。']
+```
--- a/applications/information_extraction/text/data_distill/deploy/simple_serving/README_en.md
+++ b/applications/information_extraction/text/data_distill/deploy/simple_serving/README_en.md
+# Service deployment based on PaddleNLP SimpleServing
+- [Environment Preparation](#1)
+- [Server](#2)
+- [Client](#3)
+- [Service Custom Parameters](#4)
+<a name="1"></a>
+## Environment Preparation
+Use the PaddleNLP version with SimpleServing function (or the latest develop version)
+```shell
+pip install paddlenlp >= 2.4.4
+```
+<a name="2"></a>
+## Server
+```bash
+paddlenlp server server:app --workers 1 --host 0.0.0.0 --port 8189
+```
+<a name="3"></a>
+## Client
+```bash
+python client.py
+```
+<a name="4"></a>
+## Service Custom Parameters
+### Server Custom Parameters
+#### schema replacement
+```python
+# Default schema
+schema = {"Weapon Name": ["Country of Production", "Type", "R&D Unit"]}
+```
+#### Set model path
+```
+# Default task_path
+uie = Taskflow('information_extraction', model='uie-data-distill-gp', task_path='../../checkpoint/model_best/', schema=schema)
+```
+#### Doka Service Prediction
+PaddleNLP SimpleServing supports multi-card load balancing prediction, mainly during service registration, just register two Taskflow tasks, the following is the sample code
+```
+uie1 = Taskflow('information_extraction', task_path='../../checkpoint/model_best/', schema=schema, device_id=0)
+uie2 = Taskflow('information_extraction', task_path='../../checkpoint/model_best/', schema=schema, device_id=1)
+service. register_taskflow('uie', [uie1, uie2])
+```
+### Client Custom Parameters
+```python
+# Changed to input texts you wanted
+texts = ['威尔哥（Virgo）减速炸弹是由瑞典FFV军械公司专门为瑞典皇家空军的攻击机实施低空高速轰炸而研制，1956年开始研制，1963年进入服役，装备于A32“矛盾”、A35“龙”、和AJ134“雷”攻击机，主要用于攻击登陆艇、停放的飞机、高炮、野战火炮、轻型防护装甲车辆以及有生力量。']
+```
--- a/applications/information_extraction/text/data_distill/deploy/simple_serving/client.py
+++ b/applications/information_extraction/text/data_distill/deploy/simple_serving/client.py
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import json
+import requests
+url = "http://0.0.0.0:8189/taskflow/uie"
+headers = {"Content-Type": "application/json"}
+texts = [
+    "威尔哥（Virgo）减速炸弹是由瑞典FFV军械公司专门为瑞典皇家空军的攻击机实施低空高速轰炸而研制，1956年开始研制，1963年进入服役，装备于A32“矛盾”、A35“龙”、和AJ134“雷”攻击机，主要用于攻击登陆>艇、停放的飞机、高炮、野战火炮、轻型防护装甲车辆以及有生力量。"
+]
+data = {"data": {"text": texts}}
+r = requests.post(url=url, headers=headers, data=json.dumps(data))
+datas = json.loads(r.text)
+print(datas)
--- a/applications/information_extraction/text/data_distill/deploy/simple_serving/server.py
+++ b/applications/information_extraction/text/data_distill/deploy/simple_serving/server.py
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from paddlenlp import SimpleServer, Taskflow
+# The schema changed to your defined schema
+schema = {"武器名称": ["产国", "类型", "研发单位"]}
+# The task path changed to your best model path
+uie = Taskflow(
+    "information_extraction", model="uie-data-distill-gp", schema=schema, task_path="../../checkpoint/model_best/"
+)
+# If you want to define the finetuned uie service
+app = SimpleServer()
+app.register_taskflow("taskflow/uie", uie)
--- a/applications/information_extraction/text/data_distill/evaluate.py
+++ b/applications/information_extraction/text/data_distill/evaluate.py
+# coding=utf-8
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import argparse
+import os
+import paddle
+from metric import get_eval
+from tqdm import tqdm
+from utils import create_dataloader, get_label_maps, postprocess, reader
+from paddlenlp.datasets import load_dataset
+from paddlenlp.layers import (
+    GlobalPointerForEntityExtraction,
+    GPLinkerForRelationExtraction,
+)
+from paddlenlp.transformers import AutoModel, AutoTokenizer
+from paddlenlp.utils.log import logger
+@paddle.no_grad()
+def evaluate(model, dataloader, label_maps, task_type="relation_extraction"):
+    model.eval()
+    all_preds = ([], []) if task_type in ["opinion_extraction", "relation_extraction", "event_extraction"] else []
+    for batch in tqdm(dataloader, desc="Evaluating: ", leave=False):
+        input_ids, attention_masks, offset_mappings, texts = batch
+        logits = model(input_ids, attention_masks)
+        batch_outputs = postprocess(logits, offset_mappings, texts, label_maps, task_type)
+        if isinstance(batch_outputs, tuple):
+            all_preds[0].extend(batch_outputs[0])  # Entity output
+            all_preds[1].extend(batch_outputs[1])  # Relation output
+        else:
+            all_preds.extend(batch_outputs)
+    eval_results = get_eval(all_preds, dataloader.dataset.raw_data, task_type)
+    model.train()
+    return eval_results
+def do_eval():
+    label_maps = get_label_maps(args.task_type, args.label_maps_path)
+    tokenizer = AutoTokenizer.from_pretrained(args.encoder)
+    encoder = AutoModel.from_pretrained(args.encoder)
+    if args.task_type == "entity_extraction":
+        model = GlobalPointerForEntityExtraction(encoder, label_maps)
+    else:
+        model = GPLinkerForRelationExtraction(encoder, label_maps)
+    if args.model_path:
+        state_dict = paddle.load(os.path.join(args.model_path, "model_state.pdparams"))
+        model.set_dict(state_dict)
+    test_ds = load_dataset(reader, data_path=args.test_path, lazy=False)
+    test_dataloader = create_dataloader(
+        test_ds,
+        tokenizer,
+        max_seq_len=args.max_seq_len,
+        batch_size=args.batch_size,
+        label_maps=label_maps,
+        mode="test",
+        task_type=args.task_type,
+    )
+    eval_result = evaluate(model, test_dataloader, label_maps, task_type=args.task_type)
+    logger.info("Evaluation precision: " + str(eval_result))
+if __name__ == "__main__":
+    # yapf: disable
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--model_path", type=str, default=None, help="The path of saved model that you want to load.")
+    parser.add_argument("--test_path", type=str, default=None, help="The path of test set.")
+    parser.add_argument("--encoder", default="ernie-3.0-mini-zh", type=str, help="Select the pretrained encoder model for GP.")
+    parser.add_argument("--label_maps_path", default="./ner_data/label_maps.json", type=str, help="The file path of the labels dictionary.")
+    parser.add_argument("--batch_size", type=int, default=16, help="Batch size per GPU/CPU for training.")
+    parser.add_argument("--max_seq_len", type=int, default=128, help="The maximum total input sequence length after tokenization.")
+    parser.add_argument("--task_type", choices=['relation_extraction', 'event_extraction', 'entity_extraction', 'opinion_extraction'], default="entity_extraction", type=str, help="Select the training task type.")
+    args = parser.parse_args()
+    # yapf: enable
+    do_eval()
--- a/applications/information_extraction/text/data_distill/evaluate_teacher.py
+++ b/applications/information_extraction/text/data_distill/evaluate_teacher.py
+# coding=utf-8
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import argparse
+import paddle
+from metric import get_eval
+from tqdm import tqdm
+from utils import create_dataloader, get_label_maps, reader, synthetic2distill
+from paddlenlp import Taskflow
+from paddlenlp.datasets import load_dataset
+from paddlenlp.transformers import AutoTokenizer
+from paddlenlp.utils.log import logger
+@paddle.no_grad()
+def evaluate(uie, dataloader, task_type="relation_extraction"):
+    all_preds = ([], []) if task_type in ["opinion_extraction", "relation_extraction", "event_extraction"] else []
+    infer_results = []
+    all_texts = []
+    for batch in tqdm(dataloader, desc="Evaluating: ", leave=False):
+        _, _, _, texts = batch
+        all_texts.extend(texts)
+        infer_results.extend(uie(texts))
+    infer_results = synthetic2distill(all_texts, infer_results, task_type)
+    for res in infer_results:
+        if task_type == "entity_extraction":
+            all_preds.append(res["entity_list"])
+        else:
+            all_preds[0].append(res["entity_list"])
+            all_preds[1].append(res["spo_list"])
+    eval_results = get_eval(all_preds, dataloader.dataset.raw_data, task_type)
+    return eval_results
+def do_eval():
+    # Load trained UIE model
+    uie = Taskflow("information_extraction", schema=args.schema, batch_size=args.batch_size, task_path=args.model_path)
+    label_maps = get_label_maps(args.task_type, args.label_maps_path)
+    tokenizer = AutoTokenizer.from_pretrained("ernie-3.0-base-zh")
+    test_ds = load_dataset(reader, data_path=args.test_path, lazy=False)
+    test_dataloader = create_dataloader(
+        test_ds,
+        tokenizer,
+        max_seq_len=args.max_seq_len,
+        batch_size=args.batch_size,
+        label_maps=label_maps,
+        mode="test",
+        task_type=args.task_type,
+    )
+    eval_result = evaluate(uie, test_dataloader, task_type=args.task_type)
+    logger.info("Evaluation precision: " + str(eval_result))
+if __name__ == "__main__":
+    # yapf: disable
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--model_path", type=str, default=None, help="The path of saved model that you want to load.")
+    parser.add_argument("--test_path", type=str, default=None, help="The path of test set.")
+    parser.add_argument("--label_maps_path", default="./ner_data/label_maps.json", type=str, help="The file path of the labels dictionary.")
+    parser.add_argument("--batch_size", type=int, default=8, help="Batch size per GPU/CPU for training.")
+    parser.add_argument("--max_seq_len", type=int, default=256, help="The maximum total input sequence length after tokenization.")
+    parser.add_argument("--task_type", choices=['relation_extraction', 'event_extraction', 'entity_extraction', 'opinion_extraction'], default="entity_extraction", type=str, help="Select the training task type.")
+    args = parser.parse_args()
+    # yapf: enable
+    schema = {"武器名称": ["产国", "类型", "研发单位"]}
+    args.schema = schema
+    do_eval()
--- a/applications/information_extraction/text/data_distill/metric.py
+++ b/applications/information_extraction/text/data_distill/metric.py
+# coding=utf-8
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+def get_eval(all_preds, raw_data, task_type):
+    if task_type == "entity_extraction":
+        ex, ey, ez = 1e-10, 1e-10, 1e-10
+        for ent_preds, data in zip(all_preds, raw_data):
+            pred_ent_set = set([tuple(p.values()) for p in ent_preds])
+            gold_ent_set = set([tuple(g.values()) for g in data["entity_list"]])
+            ex += len(pred_ent_set & gold_ent_set)
+            ey += len(pred_ent_set)
+            ez += len(gold_ent_set)
+        ent_f1 = round(2 * ex / (ey + ez), 5) if ex != 1e-10 else 0.0
+        ent_precision = round(ex / ey, 5) if ey != 1e-10 else 0.0
+        ent_recall = round(ex / ez, 5) if ez != 1e-10 else 0.0
+        return {
+            "entity_f1": ent_f1,
+            "entity_precision": ent_precision,
+            "entity_recall": ent_recall,
+        }
+    else:
+        all_ent_preds, all_rel_preds = all_preds
+        ex, ey, ez = 1e-10, 1e-10, 1e-10
+        for ent_preds, data in zip(all_ent_preds, raw_data):
+            pred_ent_set = set([tuple(p.values()) for p in ent_preds])
+            gold_ent_set = set([tuple(g.values()) for g in data["entity_list"]])
+            ex += len(pred_ent_set & gold_ent_set)
+            ey += len(pred_ent_set)
+            ez += len(gold_ent_set)
+        ent_f1 = round(2 * ex / (ey + ez), 5) if ex != 1e-10 else 0.0
+        ent_precision = round(ex / ey, 5) if ey != 1e-10 else 0.0
+        ent_recall = round(ex / ez, 5) if ez != 1e-10 else 0.0
+        rx, ry, rz = 1e-10, 1e-10, 1e-10
+        for rel_preds, raw_data in zip(all_rel_preds, raw_data):
+            pred_rel_set = set([tuple(p.values()) for p in rel_preds])
+            if task_type == "opinion_extraction":
+                gold_rel_set = set([tuple(g.values()) for g in raw_data["aso_list"]])
+            else:
+                gold_rel_set = set([tuple(g.values()) for g in raw_data["spo_list"]])
+            rx += len(pred_rel_set & gold_rel_set)
+            ry += len(pred_rel_set)
+            rz += len(gold_rel_set)
+        rel_f1 = round(2 * rx / (ry + rz), 5) if rx != 1e-10 else 0.0
+        rel_precision = round(rx / ry, 5) if ry != 1e-10 else 0.0
+        rel_recall = round(rx / rz, 5) if rz != 1e-10 else 0.0
+        return {
+            "entity_f1": ent_f1,
+            "entity_precision": ent_precision,
+            "entity_recall": ent_recall,
+            "relation_f1": rel_f1,
+            "relation_precision": rel_precision,
+            "relation_recall": rel_recall,
+        }
--- a/applications/information_extraction/text/data_distill/train.py
+++ b/applications/information_extraction/text/data_distill/train.py
+# coding=utf-8
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import argparse
+import os
+import time
+import paddle
+from criterion import Criterion
+from evaluate import evaluate
+from utils import (
+    create_dataloader,
+    criteria_map,
+    get_label_maps,
+    reader,
+    save_model_config,
+    set_seed,
+)
+from paddlenlp.datasets import load_dataset
+from paddlenlp.layers import (
+    GlobalPointerForEntityExtraction,
+    GPLinkerForRelationExtraction,
+)
+from paddlenlp.transformers import AutoModel, AutoTokenizer, LinearDecayWithWarmup
+from paddlenlp.utils.log import logger
+def do_train():
+    paddle.set_device(args.device)
+    rank = paddle.distributed.get_rank()
+    if paddle.distributed.get_world_size() > 1:
+        paddle.distributed.init_parallel_env()
+    set_seed(args.seed)
+    label_maps = get_label_maps(args.task_type, args.label_maps_path)
+    train_ds = load_dataset(reader, data_path=args.train_path, lazy=False)
+    dev_ds = load_dataset(reader, data_path=args.dev_path, lazy=False)
+    tokenizer = AutoTokenizer.from_pretrained(args.encoder)
+    train_dataloader = create_dataloader(
+        train_ds,
+        tokenizer,
+        max_seq_len=args.max_seq_len,
+        batch_size=args.batch_size,
+        label_maps=label_maps,
+        mode="train",
+        task_type=args.task_type,
+    )
+    dev_dataloader = create_dataloader(
+        dev_ds,
+        tokenizer,
+        max_seq_len=args.max_seq_len,
+        batch_size=args.batch_size,
+        label_maps=label_maps,
+        mode="dev",
+        task_type=args.task_type,
+    )
+    encoder = AutoModel.from_pretrained(args.encoder)
+    if args.task_type == "entity_extraction":
+        model = GlobalPointerForEntityExtraction(encoder, label_maps)
+    else:
+        model = GPLinkerForRelationExtraction(encoder, label_maps)
+    model_config = {"task_type": args.task_type, "label_maps": label_maps, "encoder": args.encoder}
+    num_training_steps = len(train_dataloader) * args.num_epochs
+    lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, args.warmup_proportion)
+    # Generate parameter names needed to perform weight decay.
+    # All bias and LayerNorm parameters are excluded.
+    decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])]
+    optimizer = paddle.optimizer.AdamW(
+        learning_rate=lr_scheduler,
+        parameters=model.parameters(),
+        weight_decay=args.weight_decay,
+        apply_decay_param_fun=lambda x: x in decay_params,
+    )
+    if args.init_from_ckpt and os.path.isfile(args.init_from_ckpt):
+        state_dict = paddle.load(args.init_from_ckpt)
+        model.set_dict(state_dict)
+    if paddle.distributed.get_world_size() > 1:
+        model = paddle.DataParallel(model)
+    criterion = Criterion()
+    global_step, best_f1 = 1, 0.0
+    tr_loss, logging_loss = 0.0, 0.0
+    tic_train = time.time()
+    for epoch in range(1, args.num_epochs + 1):
+        for batch in train_dataloader:
+            input_ids, attention_masks, labels = batch
+            logits = model(input_ids, attention_masks)
+            loss = sum([criterion(o, l) for o, l in zip(logits, labels)]) / 3
+            loss.backward()
+            tr_loss += loss.item()
+            lr_scheduler.step()
+            optimizer.step()
+            optimizer.clear_grad()
+            if global_step % args.logging_steps == 0 and rank == 0:
+                time_diff = time.time() - tic_train
+                loss_avg = (tr_loss - logging_loss) / args.logging_steps
+                logger.info(
+                    "global step %d, epoch: %d, loss: %.5f, speed: %.2f step/s"
+                    % (global_step, epoch, loss_avg, args.logging_steps / time_diff)
+                )
+                logging_loss = tr_loss
+                tic_train = time.time()
+            if global_step % args.eval_steps == 0 and rank == 0:
+                save_dir = os.path.join(args.save_dir, "model_%d" % global_step)
+                if not os.path.exists(save_dir):
+                    os.makedirs(save_dir)
+                save_param_path = os.path.join(save_dir, "model_state.pdparams")
+                paddle.save(model.state_dict(), save_param_path)
+                save_model_config(save_dir, model_config)
+                logger.disable()
+                tokenizer.save_pretrained(save_dir)
+                logger.enable()
+                eval_result = evaluate(model, dev_dataloader, label_maps, task_type=args.task_type)
+                logger.info("Evaluation precision: " + str(eval_result))
+                f1 = eval_result[criteria_map[args.task_type]]
+                if f1 > best_f1:
+                    logger.info(f"best F1 performance has been updated: {best_f1:.5f} --> {f1:.5f}")
+                    best_f1 = f1
+                    save_dir = os.path.join(args.save_dir, "model_best")
+                    if not os.path.exists(save_dir):
+                        os.makedirs(save_dir)
+                    save_param_path = os.path.join(save_dir, "model_state.pdparams")
+                    paddle.save(model.state_dict(), save_param_path)
+                    save_model_config(save_dir, model_config)
+                    logger.disable()
+                    tokenizer.save_pretrained(save_dir)
+                    logger.enable()
+                tic_train = time.time()
+            global_step += 1
+if __name__ == "__main__":
+    # yapf: disable
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--train_path", default=None, type=str, help="The path of train set.")
+    parser.add_argument("--dev_path", default=None, type=str, help="The path of dev set.")
+    parser.add_argument("--batch_size", default=16, type=int, help="Batch size per GPU/CPU for training.")
+    parser.add_argument("--learning_rate", default=3e-5, type=float, help="The initial learning rate for Adam.")
+    parser.add_argument("--save_dir", default='./checkpoint', type=str, help="The output directory where the model checkpoints will be written.")
+    parser.add_argument("--max_seq_len", default=256, type=int, help="The maximum input sequence length.")
+    parser.add_argument("--label_maps_path", default="./ner_data/label_maps.json", type=str, help="The file path of the labels dictionary.")
+    parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight decay rate for L2 regularizer.")
+    parser.add_argument("--warmup_proportion", default=0.0, type=float, help="Linear warmup proportion over the training process.")
+    parser.add_argument("--num_epochs", default=100, type=int, help="Number of epoches for training.")
+    parser.add_argument("--seed", default=1000, type=int, help="Random seed for initialization")
+    parser.add_argument("--encoder", default="ernie-3.0-mini-zh", type=str, help="Select the pretrained encoder model for GP.")
+    parser.add_argument("--task_type", choices=['relation_extraction', 'event_extraction', 'entity_extraction', 'opinion_extraction'], default="entity_extraction", type=str, help="Select the training task type.")
+    parser.add_argument("--logging_steps", default=10, type=int, help="The interval steps to logging.")
+    parser.add_argument("--eval_steps", default=200, type=int, help="The interval steps to evaluate model performance.")
+    parser.add_argument('--device', choices=['cpu', 'gpu'], default="gpu", help="Select which device to train model, defaults to gpu.")
+    parser.add_argument("--init_from_ckpt", default=None, type=str, help="The path of model parameters for initialization.")
+    args = parser.parse_args()
+    # yapf: enable
+    do_train()
--- a/applications/information_extraction/text/data_distill/utils.py
+++ b/applications/information_extraction/text/data_distill/utils.py
+# coding=utf-8
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import copy
+import json
+import os
+import random
+import numpy as np
+import paddle
+from data_collator import DataCollator
+from paddlenlp.taskflow.utils import SchemaTree
+from paddlenlp.utils.log import logger
+criteria_map = {
+    "entity_extraction": "entity_f1",
+    "opinion_extraction": "relation_f1",  # (Aspect, Sentiment, Opinion)
+    "relation_extraction": "relation_f1",  # (Subject, Predicate, Object)
+    "event_extraction": "relation_f1",  # (Trigger, Role, Argument)
+}
+def set_seed(seed):
+    paddle.seed(seed)
+    random.seed(seed)
+    np.random.seed(seed)
+def reader(data_path):
+    with open(data_path, "r", encoding="utf-8") as f:
+        for line in f:
+            json_line = json.loads(line)
+            yield json_line
+def save_model_config(save_dir, model_config):
+    model_config_file = os.path.join(save_dir, "model_config.json")
+    with open(model_config_file, "w", encoding="utf-8") as fp:
+        fp.write(json.dumps(model_config, ensure_ascii=False, indent=2))
+def map_offset(ori_offset, offset_mapping):
+    """
+    map ori offset to token offset
+    """
+    for index, span in enumerate(offset_mapping):
+        if span[0] <= ori_offset < span[1]:
+            return index
+    return -1
+def get_label_maps(task_type="relation_extraction", label_maps_path=None):
+    with open(label_maps_path, "r", encoding="utf-8") as fp:
+        label_maps = json.load(fp)
+    if task_type == "entity_extraction":
+        entity2id = label_maps["entity2id"]
+        id2entity = {idx: t for t, idx in entity2id.items()}
+        label_maps["id2entity"] = id2entity
+    else:
+        entity2id = label_maps["entity2id"]
+        relation2id = (
+            label_maps["relation2id"]
+            if task_type in ["relation_extraction", "event_extraction"]
+            else label_maps["sentiment2id"]
+        )
+        id2entity = {idx: t for t, idx in entity2id.items()}
+        id2relation = {idx: t for t, idx in relation2id.items()}
+        label_maps["id2entity"] = id2entity
+        label_maps["id2relation"] = id2relation
+    return label_maps
+def create_dataloader(
+    dataset, tokenizer, max_seq_len=128, batch_size=1, label_maps=None, mode="train", task_type="relation_extraction"
+):
+    def tokenize_and_align_train_labels(example):
+        tokenized_inputs = tokenizer(
+            example["text"],
+            max_length=max_seq_len,
+            padding=False,
+            truncation=True,
+            return_attention_mask=True,
+            return_token_type_ids=False,
+            return_offsets_mapping=True,
+        )
+        offset_mapping = tokenized_inputs["offset_mapping"]
+        ent_labels = []
+        for e in example["entity_list"]:
+            _start, _end = e["start_index"], e["start_index"] + len(e["text"]) - 1
+            start = map_offset(_start, offset_mapping)
+            end = map_offset(_end, offset_mapping)
+            if start == -1 or end == -1:
+                continue
+            label = label_maps["entity2id"][e["type"]]
+            ent_labels.append([label, start, end])
+        outputs = {
+            "input_ids": tokenized_inputs["input_ids"],
+            "attention_mask": tokenized_inputs["attention_mask"],
+            "labels": {"ent_labels": ent_labels, "rel_labels": []},
+        }
+        if task_type in ["relation_extraction", "event_extraction"]:
+            rel_labels = []
+            for r in example["spo_list"]:
+                _sh, _oh = r["subject_start_index"], r["object_start_index"]
+                _st, _ot = _sh + len(r["subject"]) - 1, _oh + len(r["object"]) - 1
+                sh = map_offset(_sh, offset_mapping)
+                st = map_offset(_st, offset_mapping)
+                oh = map_offset(_oh, offset_mapping)
+                ot = map_offset(_ot, offset_mapping)
+                if sh == -1 or st == -1 or oh == -1 or ot == -1:
+                    continue
+                p = label_maps["relation2id"][r["predicate"]]
+                rel_labels.append([sh, st, p, oh, ot])
+            outputs["labels"]["rel_labels"] = rel_labels
+        elif task_type == "opinion_extraction":
+            rel_labels = []
+            for r in example["aso_list"]:
+                _ah, _oh = r["aspect_start_index"], r["opinion_start_index"]
+                _at, _ot = _ah + len(r["aspect"]) - 1, _oh + len(r["opinion"]) - 1
+                ah = map_offset(_ah, offset_mapping)
+                at = map_offset(_at, offset_mapping)
+                oh = map_offset(_oh, offset_mapping)
+                ot = map_offset(_ot, offset_mapping)
+                if ah == -1 or at == -1 or oh == -1 or ot == -1:
+                    continue
+                s = label_maps["sentiment2id"][r["sentiment"]]
+                rel_labels.append([ah, at, s, oh, ot])
+            outputs["labels"]["rel_labels"] = rel_labels
+        return outputs
+    def tokenize(example):
+        tokenized_inputs = tokenizer(
+            example["text"],
+            max_length=max_seq_len,
+            padding=False,
+            truncation=True,
+            return_attention_mask=True,
+            return_offsets_mapping=True,
+            return_token_type_ids=False,
+        )
+        tokenized_inputs["text"] = example["text"]
+        return tokenized_inputs
+    if mode == "train":
+        dataset = dataset.map(tokenize_and_align_train_labels)
+    else:
+        dataset_copy = copy.deepcopy(dataset)
+        dataset = dataset.map(tokenize)
+    data_collator = DataCollator(tokenizer, label_maps=label_maps, task_type=task_type)
+    shuffle = True if mode == "train" else False
+    batch_sampler = paddle.io.BatchSampler(dataset=dataset, batch_size=batch_size, shuffle=shuffle)
+    dataloader = paddle.io.DataLoader(
+        dataset=dataset, batch_sampler=batch_sampler, collate_fn=data_collator, num_workers=0, return_list=True
+    )
+    if mode != "train":
+        dataloader.dataset.raw_data = dataset_copy
+    return dataloader
+def postprocess(batch_outputs, offset_mappings, texts, label_maps, task_type="relation_extraction"):
+    if task_type == "entity_extraction":
+        batch_ent_results = []
+        for entity_output, offset_mapping, text in zip(batch_outputs[0].numpy(), offset_mappings, texts):
+            entity_output[:, [0, -1]] -= np.inf
+            entity_output[:, :, [0, -1]] -= np.inf
+            ent_list = []
+            for l, start, end in zip(*np.where(entity_output > 0.0)):
+                start, end = (offset_mapping[start][0], offset_mapping[end][-1])
+                ent = {"text": text[start:end], "type": label_maps["id2entity"][l], "start_index": start}
+                ent_list.append(ent)
+            batch_ent_results.append(ent_list)
+        return batch_ent_results
+    else:
+        batch_ent_results = []
+        batch_rel_results = []
+        for entity_output, head_output, tail_output, offset_mapping, text in zip(
+            batch_outputs[0].numpy(),
+            batch_outputs[1].numpy(),
+            batch_outputs[2].numpy(),
+            offset_mappings,
+            texts,
+        ):
+            entity_output[:, [0, -1]] -= np.inf
+            entity_output[:, :, [0, -1]] -= np.inf
+            ents = set()
+            ent_list = []
+            for l, start, end in zip(*np.where(entity_output > 0.0)):
+                ents.add((start, end))
+                start, end = (offset_mapping[start][0], offset_mapping[end][-1])
+                ent = {"text": text[start:end], "type": label_maps["id2entity"][l], "start_index": start}
+                ent_list.append(ent)
+            batch_ent_results.append(ent_list)
+            rel_list = []
+            for sh, st in ents:
+                for oh, ot in ents:
+                    p1s = np.where(head_output[:, sh, oh] > 0.0)[0]
+                    p2s = np.where(tail_output[:, st, ot] > 0.0)[0]
+                    ps = set(p1s) & set(p2s)
+                    for p in ps:
+                        if task_type in ["relation_extraction", "event_extraction"]:
+                            rel = {
+                                "subject": text[offset_mapping[sh][0] : offset_mapping[st][1]],
+                                "predicate": label_maps["id2relation"][p],
+                                "object": text[offset_mapping[oh][0] : offset_mapping[ot][1]],
+                                "subject_start_index": offset_mapping[sh][0],
+                                "object_start_index": offset_mapping[oh][0],
+                            }
+                        else:
+                            rel = {
+                                "aspect": text[offset_mapping[sh][0] : offset_mapping[st][1]],
+                                "sentiment": label_maps["id2relation"][p],
+                                "opinion": text[offset_mapping[oh][0] : offset_mapping[ot][1]],
+                                "aspect_start_index": offset_mapping[sh][0],
+                                "opinion_start_index": offset_mapping[oh][0],
+                            }
+                        rel_list.append(rel)
+            batch_rel_results.append(rel_list)
+        return (batch_ent_results, batch_rel_results)
+def build_tree(schema, name="root"):
+    """
+    Build the schema tree.
+    """
+    schema_tree = SchemaTree(name)
+    for s in schema:
+        if isinstance(s, str):
+            schema_tree.add_child(SchemaTree(s))
+        elif isinstance(s, dict):
+            for k, v in s.items():
+                if isinstance(v, str):
+                    child = [v]
+                elif isinstance(v, list):
+                    child = v
+                else:
+                    raise TypeError(
+                        "Invalid schema, value for each key:value pairs should be list or string"
+                        "but {} received".format(type(v))
+                    )
+                schema_tree.add_child(build_tree(child, name=k))
+        else:
+            raise TypeError("Invalid schema, element should be string or dict, " "but {} received".format(type(s)))
+    return schema_tree
+def schema2label_maps(task_type, schema=None):
+    if schema and isinstance(schema, dict):
+        schema = [schema]
+    label_maps = {}
+    if task_type == "entity_extraction":
+        entity2id = {}
+        for s in schema:
+            entity2id[s] = len(entity2id)
+        label_maps["entity2id"] = entity2id
+    elif task_type == "opinion_extraction":
+        schema = ["观点词", {"评价维度": ["观点词", "情感倾向[正向,负向]"]}]
+        logger.info("Opinion extraction does not support custom schema, the schema is default to %s." % schema)
+        label_maps["entity2id"] = {"评价维度": 0, "观点词": 1}
+        label_maps["sentiment2id"] = {"正向": 0, "负向": 1}
+    else:
+        entity2id = {}
+        relation2id = {}
+        schema_tree = build_tree(schema)
+        schema_list = schema_tree.children[:]
+        while len(schema_list) > 0:
+            node = schema_list.pop(0)
+            if node.name not in entity2id.keys() and len(node.children) != 0:
+                entity2id[node.name] = len(entity2id)
+            for child in node.children:
+                if child.name not in relation2id.keys():
+                    relation2id[child.name] = len(relation2id)
+                schema_list.append(child)
+        entity2id["object"] = len(entity2id)
+        label_maps["entity2id"] = entity2id
+        label_maps["relation2id"] = relation2id
+    label_maps["schema"] = schema
+    return label_maps
+def anno2distill(json_lines, task_type, label_maps=None, platform="label_studio"):
+    if platform == "label_studio":
+        return label_studio2distill(json_lines, task_type, label_maps)
+    else:
+        return doccano2distill(json_lines, task_type, label_maps)
+def label_studio2distill(json_lines, task_type, label_maps=None):
+    """Convert label-studio to distill format"""
+    if task_type == "opinion_extraction":
+        outputs = []
+        for json_line in json_lines:
+            id2ent = {}
+            text = json_line["data"]["text"]
+            output = {"text": text}
+            entity_list = []
+            aso_list = []
+            annos = json_line["annotations"][0]["result"]
+            for anno in annos:
+                if anno["type"] == "labels":
+                    ent_text = text[anno["value"]["start"] : anno["value"]["end"]]
+                    ent_type_gather = anno["value"]["labels"][0].split("##")
+                    if len(ent_type_gather) == 2:
+                        ent_type, ent_senti = ent_type_gather
+                    else:
+                        ent_type = ent_type_gather[0]
+                        ent_senti = None
+                    ent = {"text": ent_text, "type": ent_type, "start_index": anno["value"]["start"]}
+                    id2ent[anno["id"]] = ent
+                    id2ent[anno["id"]]["sentiment"] = ent_senti
+                    entity_list.append(ent)
+                else:
+                    _aspect = id2ent[anno["from_id"]]
+                    if _aspect["sentiment"]:
+                        _opinion = id2ent[anno["to_id"]]
+                        rel = {
+                            "aspect": _aspect["text"],
+                            "sentiment": _aspect["sentiment"],
+                            "opinion": _opinion["text"],
+                            "aspect_start_index": _aspect["start_index"],
+                            "opinion_start_index": _opinion["start_index"],
+                        }
+                        aso_list.append(rel)
+                    output["aso_list"] = aso_list
+            output["entity_list"] = entity_list
+            output["aso_list"] = aso_list
+            outputs.append(output)
+    else:
+        outputs = []
+        for json_line in json_lines:
+            id2ent = {}
+            text = json_line["data"]["text"]
+            output = {"text": text}
+            entity_list = []
+            spo_list = []
+            annos = json_line["annotations"][0]["result"]
+            for anno in annos:
+                if anno["type"] == "labels":
+                    ent_text = text[anno["value"]["start"] : anno["value"]["end"]]
+                    ent_label = anno["value"]["labels"][0]
+                    ent_type = "object" if ent_label not in label_maps["entity2id"].keys() else ent_label
+                    ent = {"text": ent_text, "type": ent_type, "start_index": anno["value"]["start"]}
+                    id2ent[anno["id"]] = ent
+                    entity_list.append(ent)
+                else:
+                    _subject = id2ent[anno["from_id"]]
+                    _object = id2ent[anno["to_id"]]
+                    rel = {
+                        "subject": _subject["text"],
+                        "predicate": anno["labels"][0],
+                        "object": _object["text"],
+                        "subject_start_index": _subject["start_index"],
+                        "object_start_index": _object["start_index"],
+                    }
+                    spo_list.append(rel)
+            output["entity_list"] = entity_list
+            output["spo_list"] = spo_list
+            outputs.append(output)
+    return outputs
+def doccano2distill(json_lines, task_type, label_maps=None):
+    """Convert doccano to distill format"""
+    if task_type == "opinion_extraction":
+        outputs = []
+        for json_line in json_lines:
+            id2ent = {}
+            text = json_line["text"]
+            output = {"text": text}
+            entity_list = []
+            entities = json_line["entities"]
+            for entity in entities:
+                ent_text = text[entity["start_offset"] : entity["end_offset"]]
+                ent_type_gather = entity["label"].split("##")
+                if len(ent_type_gather) == 2:
+                    ent_type, ent_senti = ent_type_gather
+                else:
+                    ent_type = ent_type_gather[0]
+                    ent_senti = None
+                ent = {"text": ent_text, "type": ent_type, "start_index": entity["start_offset"]}
+                id2ent[entity["id"]] = ent
+                id2ent[entity["id"]]["sentiment"] = ent_senti
+                entity_list.append(ent)
+            output["entity_list"] = entity_list
+            aso_list = []
+            relations = json_line["relations"]
+            for relation in relations:
+                _aspect = id2ent[relation["from_id"]]
+                if _aspect["sentiment"]:
+                    _opinion = id2ent[relation["to_id"]]
+                    rel = {
+                        "aspect": _aspect["text"],
+                        "sentiment": _aspect["sentiment"],
+                        "opinion": _opinion["text"],
+                        "aspect_start_index": _aspect["start_index"],
+                        "opinion_start_index": _opinion["start_index"],
+                    }
+                    aso_list.append(rel)
+            output["aso_list"] = aso_list
+            outputs.append(output)
+    else:
+        outputs = []
+        for json_line in json_lines:
+            id2ent = {}
+            text = json_line["text"]
+            output = {"text": text}
+            entity_list = []
+            entities = json_line["entities"]
+            for entity in entities:
+                ent_text = text[entity["start_offset"] : entity["end_offset"]]
+                if entity["label"] not in label_maps["entity2id"].keys():
+                    if task_type == "entity_extraction":
+                        logger.warning(
+                            "Found undefined label type. The setting of schema should contain all the label types in annotation file export from annotation platform."
+                        )
+                        continue
+                    else:
+                        ent_type = "object"
+                else:
+                    ent_type = entity["label"]
+                ent = {"text": ent_text, "type": ent_type, "start_index": entity["start_offset"]}
+                id2ent[entity["id"]] = ent
+                entity_list.append(ent)
+            output["entity_list"] = entity_list
+            spo_list = []
+            relations = json_line["relations"]
+            for relation in relations:
+                _subject = id2ent[relation["from_id"]]
+                _object = id2ent[relation["to_id"]]
+                rel = {
+                    "subject": _subject["text"],
+                    "predicate": relation["type"],
+                    "object": _object["text"],
+                    "subject_start_index": _subject["start_index"],
+                    "object_start_index": _object["start_index"],
+                }
+                spo_list.append(rel)
+            output["spo_list"] = spo_list
+            outputs.append(output)
+    return outputs
+def synthetic2distill(texts, infer_results, task_type, label_maps=None):
+    """Convert synthetic data to distill format"""
+    if task_type == "opinion_extraction":
+        outputs = []
+        for i, line in enumerate(infer_results):
+            pred = line
+            output = {"text": texts[i]}
+            entity_list = []
+            aso_list = []
+            for key1 in pred.keys():
+                for s in pred[key1]:
+                    ent = {"text": s["text"], "type": key1, "start_index": s["start"]}
+                    entity_list.append(ent)
+                    if (
+                        "relations" in s.keys()
+                        and "观点词" in s["relations"].keys()
+                        and "情感倾向[正向,负向]" in s["relations"].keys()
+                    ):
+                        for o in s["relations"]["观点词"]:
+                            rel = {
+                                "aspect": s["text"],
+                                "sentiment": s["relations"]["情感倾向[正向,负向]"][0]["text"],
+                                "opinion": o["text"],
+                                "aspect_start_index": s["start"],
+                                "opinion_start_index": o["start"],
+                            }
+                            aso_list.append(rel)
+                            ent = {"text": o["text"], "type": "观点词", "start_index": o["start"]}
+                            entity_list.append(ent)
+            output["entity_list"] = entity_list
+            output["aso_list"] = aso_list
+            outputs.append(output)
+    else:
+        outputs = []
+        for i, line in enumerate(infer_results):
+            pred = line
+            output = {"text": texts[i]}
+            entity_list = []
+            spo_list = []
+            for key1 in pred.keys():
+                for s in pred[key1]:
+                    ent = {"text": s["text"], "type": key1, "start_index": s["start"]}
+                    entity_list.append(ent)
+                    if "relations" in s.keys():
+                        for key2 in s["relations"].keys():
+                            for o1 in s["relations"][key2]:
+                                if "start" in o1.keys():
+                                    rel = {
+                                        "subject": s["text"],
+                                        "predicate": key2,
+                                        "object": o1["text"],
+                                        "subject_start_index": s["start"],
+                                        "object_start_index": o1["start"],
+                                    }
+                                    spo_list.append(rel)
+                                    if "relations" not in o1.keys():
+                                        ent = {"text": o1["text"], "type": "object", "start_index": o1["start"]}
+                                        entity_list.append(ent)
+                                    else:
+                                        ent = {"text": o1["text"], "type": key2, "start_index": o1["start"]}
+                                        entity_list.append(ent)
+                                        for key3 in o1["relations"].keys():
+                                            for o2 in o1["relations"][key3]:
+                                                ent = {
+                                                    "text": o2["text"],
+                                                    "type": "object",
+                                                    "start_index": o2["start"],
+                                                }
+                                                entity_list.append(ent)
+                                                rel = {
+                                                    "subject": o1["text"],
+                                                    "predicate": key3,
+                                                    "object": o2["text"],
+                                                    "subject_start_index": o1["start"],
+                                                    "object_start_index": o2["start"],
+                                                }
+                                                spo_list.append(rel)
+            output["entity_list"] = entity_list
+            output["spo_list"] = spo_list
+            outputs.append(output)
+    return outputs
--- a/applications/information_extraction/text/deploy/simple_serving/README.md
+++ b/applications/information_extraction/text/deploy/simple_serving/README.md
+# 基于PaddleNLP SimpleServing 的服务化部署
+## 目录
+- [环境准备](#环境准备)
+- [Server服务启动](#Server服务启动)
+- [Client请求启动](#Client请求启动)
+- [服务化自定义参数](#服务化自定义参数)
+## 环境准备
+使用有SimpleServing功能的PaddleNLP版本(或者最新的develop版本)
+```shell
+pip install paddlenlp >= 2.4.4
+```
+## Server服务启动
+```bash
+paddlenlp server server:app --workers 1 --host 0.0.0.0 --port 8189
+```
+## Client请求启动
+```bash
+python client.py
+```
+## 服务化自定义参数
+### Server 自定义参数
+#### schema替换
+```python
+# Default schema
+schema = {"武器名称": ["产国", "类型", "研发单位"]}
+```
+#### 设置模型路径
+```
+# Default task_path
+uie = Taskflow('information_extraction', task_path='../../checkpoint/model_best/', schema=schema)
+```
+#### 多卡服务化预测
+PaddleNLP SimpleServing 支持多卡负载均衡预测，主要在服务化注册的时候，注册两个Taskflow的task即可，下面是示例代码
+```
+uie1 = Taskflow('information_extraction', task_path='../../checkpoint/model_best/', schema=schema, device_id=0)
+uie2 = Taskflow('information_extraction', task_path='../../checkpoint/model_best/', schema=schema, device_id=1)
+service.register_taskflow('uie', [uie1, uie2])
+```
+### Client 自定义参数
+```python
+# Changed to input texts you wanted
+texts = ['威尔哥（Virgo）减速炸弹是由瑞典FFV军械公司专门为瑞典皇家空军的攻击机实施低空高速轰炸而研制，1956年开始研制，1963年进入服役，装备于A32“矛盾”、A35“龙”、和AJ134“雷”攻击机，主要用于攻击登陆>艇、停放的飞机、高炮、野战火炮、轻型防护装甲车辆以及有生力量。']
+```
--- a/applications/information_extraction/text/deploy/simple_serving/README_en.md
+++ b/applications/information_extraction/text/deploy/simple_serving/README_en.md
+# Service deployment based on PaddleNLP SimpleServing
+- [Environment Preparation](#1)
+- [Server](#2)
+- [Client](#3)
+- [Service Custom Parameters](#4)
+<a name="1"></a>
+## Environment Preparation
+Use the PaddleNLP version with SimpleServing function (or the latest develop version)
+```shell
+pip install paddlenlp >= 2.4.4
+```
+<a name="2"></a>
+## Server
+```bash
+paddlenlp server server:app --workers 1 --host 0.0.0.0 --port 8189
+```
+<a name="3"></a>
+## Client
+```bash
+python client.py
+```
+<a name="4"></a>
+## Service Custom Parameters
+### Server Custom Parameters
+#### schema replacement
+```python
+# Default schema
+schema = {"Weapon Name": ["Country of Production", "Type", "R&D Unit"]}
+```
+#### Set model path
+```
+# Default task_path
+uie = Taskflow('information_extraction', task_path='../../checkpoint/model_best/', schema=schema)
+```
+#### Doka Service Prediction
+PaddleNLP SimpleServing supports multi-card load balancing prediction, mainly during service registration, just register two Taskflow tasks, the following is the sample code
+```
+uie1 = Taskflow('information_extraction', task_path='../../checkpoint/model_best/', schema=schema, device_id=0)
+uie2 = Taskflow('information_extraction', task_path='../../checkpoint/model_best/', schema=schema, device_id=1)
+service. register_taskflow('uie', [uie1, uie2])
+```
+### Client Custom Parameters
+```python
+# Changed to input texts you wanted
+texts = ['威尔哥（Virgo）减速炸弹是由瑞典FFV军械公司专门为瑞典皇家空军的攻击机实施低空高速轰炸而研制，1956年开始研制，1963年进入服役，装备于A32“矛盾”、A35“龙”、和AJ134“雷”攻击机，主要用于攻击登陆艇、停放的飞机、高炮、野战火炮、轻型防护装甲车辆以及有生力量。']
+```
--- a/applications/information_extraction/text/deploy/simple_serving/client.py
+++ b/applications/information_extraction/text/deploy/simple_serving/client.py
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import json
+import requests
+url = "http://0.0.0.0:8189/taskflow/uie"
+headers = {"Content-Type": "application/json"}
+texts = [
+    "威尔哥（Virgo）减速炸弹是由瑞典FFV军械公司专门为瑞典皇家空军的攻击机实施低空高速轰炸而研制，1956年开始研制，1963年进入服役，装备于A32“矛盾”、A35“龙”、和AJ134“雷”攻击机，主要用于攻击登陆>艇、停放的飞机、高炮、野战火炮、轻型防护装甲车辆以及有生力量。"
+]
+data = {"data": {"text": texts}}
+r = requests.post(url=url, headers=headers, data=json.dumps(data))
+datas = json.loads(r.text)
+print(datas)
--- a/applications/information_extraction/text/deploy/simple_serving/server.py
+++ b/applications/information_extraction/text/deploy/simple_serving/server.py
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from paddlenlp import SimpleServer, Taskflow
+# The schema changed to your defined schema
+schema = {"武器名称": ["产国", "类型", "研发单位"]}
+# The task path changed to your best model path
+uie = Taskflow("information_extraction", schema=schema, task_path="../../checkpoint/model_best/")
+# If you want to define the finetuned uie service
+app = SimpleServer()
+app.register_taskflow("taskflow/uie", uie)
--- a/applications/information_extraction/text/evaluate.py
+++ b/applications/information_extraction/text/evaluate.py
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import argparse
+from functools import partial
+import paddle
+from utils import convert_example, create_data_loader, reader
+from paddlenlp.data import DataCollatorWithPadding
+from paddlenlp.datasets import MapDataset, load_dataset
+from paddlenlp.metrics import SpanEvaluator
+from paddlenlp.transformers import UIE, UIEM, AutoTokenizer
+from paddlenlp.utils.ie_utils import get_relation_type_dict, unify_prompt_name
+from paddlenlp.utils.log import logger
+@paddle.no_grad()
+def evaluate(model, metric, data_loader, multilingual=False):
+    """
+    Given a dataset, it evals model and computes the metric.
+    Args:
+        model(obj:`paddle.nn.Layer`): A model to classify texts.
+        metric(obj:`paddle.metric.Metric`): The evaluation metric.
+        data_loader(obj:`paddle.io.DataLoader`): The dataset loader which generates batches.
+        multilingual(bool): Whether is the multilingual model.
+    """
+    model.eval()
+    metric.reset()
+    for batch in data_loader:
+        if multilingual:
+            start_prob, end_prob = model(batch["input_ids"], batch["position_ids"])
+        else:
+            start_prob, end_prob = model(
+                batch["input_ids"], batch["token_type_ids"], batch["position_ids"], batch["attention_mask"]
+            )
+        start_ids = paddle.cast(batch["start_positions"], "float32")
+        end_ids = paddle.cast(batch["end_positions"], "float32")
+        num_correct, num_infer, num_label = metric.compute(start_prob, end_prob, start_ids, end_ids)
+        metric.update(num_correct, num_infer, num_label)
+    precision, recall, f1 = metric.accumulate()
+    model.train()
+    return precision, recall, f1
+def do_eval():
+    paddle.set_device(args.device)
+    if args.model_path in ["uie-m-base", "uie-m-large"]:
+        args.multilingual = True
+    tokenizer = AutoTokenizer.from_pretrained(args.model_path)
+    if args.multilingual:
+        model = UIEM.from_pretrained(args.model_path)
+    else:
+        model = UIE.from_pretrained(args.model_path)
+    test_ds = load_dataset(reader, data_path=args.test_path, max_seq_len=args.max_seq_len, lazy=False)
+    class_dict = {}
+    relation_data = []
+    if args.debug:
+        for data in test_ds:
+            class_name = unify_prompt_name(data["prompt"])
+            # Only positive examples are evaluated in debug mode
+            if len(data["result_list"]) != 0:
+                p = "的" if args.schema_lang == "ch" else " of "
+                if p not in data["prompt"]:
+                    class_dict.setdefault(class_name, []).append(data)
+                else:
+                    relation_data.append((data["prompt"], data))
+        relation_type_dict = get_relation_type_dict(relation_data, schema_lang=args.schema_lang)
+    else:
+        class_dict["all_classes"] = test_ds
+    trans_fn = partial(
+        convert_example, tokenizer=tokenizer, max_seq_len=args.max_seq_len, multilingual=args.multilingual
+    )
+    for key in class_dict.keys():
+        if args.debug:
+            test_ds = MapDataset(class_dict[key])
+        else:
+            test_ds = class_dict[key]
+        test_ds = test_ds.map(trans_fn)
+        data_collator = DataCollatorWithPadding(tokenizer)
+        test_data_loader = create_data_loader(test_ds, mode="test", batch_size=args.batch_size, trans_fn=data_collator)
+        metric = SpanEvaluator()
+        precision, recall, f1 = evaluate(model, metric, test_data_loader, args.multilingual)
+        logger.info("-----------------------------")
+        logger.info("Class Name: %s" % key)
+        logger.info("Evaluation Precision: %.5f | Recall: %.5f | F1: %.5f" % (precision, recall, f1))
+    if args.debug and len(relation_type_dict.keys()) != 0:
+        for key in relation_type_dict.keys():
+            test_ds = MapDataset(relation_type_dict[key])
+            test_ds = test_ds.map(trans_fn)
+            test_data_loader = create_data_loader(
+                test_ds, mode="test", batch_size=args.batch_size, trans_fn=data_collator
+            )
+            metric = SpanEvaluator()
+            precision, recall, f1 = evaluate(model, metric, test_data_loader)
+            logger.info("-----------------------------")
+            if args.schema_lang == "ch":
+                logger.info("Class Name: X的%s" % key)
+            else:
+                logger.info("Class Name: %s of X" % key)
+            logger.info("Evaluation Precision: %.5f | Recall: %.5f | F1: %.5f" % (precision, recall, f1))
+if __name__ == "__main__":
+    # yapf: disable
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--model_path", type=str, default=None, help="The path of saved model that you want to load.")
+    parser.add_argument("--test_path", type=str, default=None, help="The path of test set.")
+    parser.add_argument("--batch_size", type=int, default=16, help="Batch size per GPU/CPU for training.")
+    parser.add_argument("--device", type=str, default="gpu", choices=["gpu", "cpu", "npu"], help="Device selected for evaluate.")
+    parser.add_argument("--max_seq_len", type=int, default=512, help="The maximum total input sequence length after tokenization.")
+    parser.add_argument("--debug", action='store_true', help="Precision, recall and F1 score are calculated for each class separately if this option is enabled.")
+    parser.add_argument("--multilingual", action='store_true', help="Whether is the multilingual model.")
+    parser.add_argument("--schema_lang", choices=["ch", "en"], default="ch", help="Select the language type for schema.")
+    args = parser.parse_args()
+    # yapf: enable
+    do_eval()
--- a/applications/information_extraction/text/finetune.py
+++ b/applications/information_extraction/text/finetune.py
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import json
+import os
+from dataclasses import dataclass, field
+from functools import partial
+from typing import List, Optional
+import paddle
+from utils import convert_example, reader
+from paddlenlp.data import DataCollatorWithPadding
+from paddlenlp.datasets import load_dataset
+from paddlenlp.metrics import SpanEvaluator
+from paddlenlp.trainer import (
+    CompressionArguments,
+    PdArgumentParser,
+    Trainer,
+    get_last_checkpoint,
+)
+from paddlenlp.transformers import UIE, UIEM, AutoTokenizer, export_model
+from paddlenlp.utils.ie_utils import compute_metrics, uie_loss_func
+from paddlenlp.utils.log import logger
+@dataclass
+class DataArguments:
+    """
+    Arguments pertaining to what data we are going to input our model for training and eval.
+    Using `PdArgumentParser` we can turn this class into argparse arguments to be able to
+    specify them on the command line.
+    """
+    train_path: str = field(
+        default=None, metadata={"help": "The name of the dataset to use (via the datasets library)."}
+    )
+    dev_path: str = field(
+        default=None, metadata={"help": "The name of the dataset to use (via the datasets library)."}
+    )
+    max_seq_length: Optional[int] = field(
+        default=512,
+        metadata={
+            "help": "The maximum total input sequence length after tokenization. Sequences longer "
+            "than this will be truncated, sequences shorter will be padded."
+        },
+    )
+    dynamic_max_length: Optional[List[int]] = field(
+        default=None,
+        metadata={"help": "dynamic max length from batch, it can be array of length, eg: 16 32 64 128"},
+    )
+@dataclass
+class ModelArguments:
+    """
+    Arguments pertaining to which model/config/tokenizer we are going to fine-tune from.
+    """
+    model_name_or_path: Optional[str] = field(
+        default="uie-base",
+        metadata={
+            "help": "Path to pretrained model, such as 'uie-base', 'uie-tiny', "
+            "'uie-medium', 'uie-mini', 'uie-micro', 'uie-nano', 'uie-base-en', "
+            "'uie-m-base', 'uie-m-large', or finetuned model path."
+        },
+    )
+    export_model_dir: Optional[str] = field(
+        default=None,
+        metadata={"help": "Path to directory to store the exported inference model."},
+    )
+    multilingual: bool = field(default=False, metadata={"help": "Whether the model is a multilingual model."})
+def main():
+    parser = PdArgumentParser((ModelArguments, DataArguments, CompressionArguments))
+    model_args, data_args, training_args = parser.parse_args_into_dataclasses()
+    training_args.label_names = ["start_positions", "end_positions"]
+    if model_args.model_name_or_path in ["uie-m-base", "uie-m-large"]:
+        model_args.multilingual = True
+    elif os.path.exists(os.path.join(model_args.model_name_or_path, "model_config.json")):
+        with open(os.path.join(model_args.model_name_or_path, "model_config.json")) as f:
+            init_class = json.load(f)["init_class"]
+        if init_class == "UIEM":
+            model_args.multilingual = True
+    # Log model and data config
+    training_args.print_config(model_args, "Model")
+    training_args.print_config(data_args, "Data")
+    paddle.set_device(training_args.device)
+    # Log on each process the small summary:
+    logger.warning(
+        f"Process rank: {training_args.local_rank}, device: {training_args.device}, world_size: {training_args.world_size}, "
+        + f"distributed training: {bool(training_args.local_rank != -1)}, 16-bits training: {training_args.fp16}"
+    )
+    # Detecting last checkpoint.
+    last_checkpoint = None
+    if os.path.isdir(training_args.output_dir) and training_args.do_train and not training_args.overwrite_output_dir:
+        last_checkpoint = get_last_checkpoint(training_args.output_dir)
+        if last_checkpoint is None and len(os.listdir(training_args.output_dir)) > 0:
+            raise ValueError(
+                f"Output directory ({training_args.output_dir}) already exists and is not empty. "
+                "Use --overwrite_output_dir to overcome."
+            )
+        elif last_checkpoint is not None and training_args.resume_from_checkpoint is None:
+            logger.info(
+                f"Checkpoint detected, resuming training at {last_checkpoint}. To avoid this behavior, change "
+                "the `--output_dir` or add `--overwrite_output_dir` to train from scratch."
+            )
+    tokenizer = AutoTokenizer.from_pretrained(model_args.model_name_or_path)
+    if model_args.multilingual:
+        model = UIEM.from_pretrained(model_args.model_name_or_path)
+    else:
+        model = UIE.from_pretrained(model_args.model_name_or_path)
+    train_ds = load_dataset(reader, data_path=data_args.train_path, max_seq_len=data_args.max_seq_length, lazy=False)
+    dev_ds = load_dataset(reader, data_path=data_args.dev_path, max_seq_len=data_args.max_seq_length, lazy=False)
+    trans_fn = partial(
+        convert_example,
+        tokenizer=tokenizer,
+        max_seq_len=data_args.max_seq_length,
+        multilingual=model_args.multilingual,
+        dynamic_max_length=data_args.dynamic_max_length,
+    )
+    train_ds = train_ds.map(trans_fn)
+    dev_ds = dev_ds.map(trans_fn)
+    if training_args.device == "npu":
+        data_collator = DataCollatorWithPadding(tokenizer, padding="longest")
+    else:
+        data_collator = DataCollatorWithPadding(tokenizer)
+    trainer = Trainer(
+        model=model,
+        criterion=uie_loss_func,
+        args=training_args,
+        data_collator=data_collator,
+        train_dataset=train_ds if training_args.do_train or training_args.do_compress else None,
+        eval_dataset=dev_ds if training_args.do_eval or training_args.do_compress else None,
+        tokenizer=tokenizer,
+        compute_metrics=compute_metrics,
+    )
+    trainer.optimizer = paddle.optimizer.AdamW(
+        learning_rate=training_args.learning_rate, parameters=model.parameters()
+    )
+    checkpoint = None
+    if training_args.resume_from_checkpoint is not None:
+        checkpoint = training_args.resume_from_checkpoint
+    elif last_checkpoint is not None:
+        checkpoint = last_checkpoint
+    # Training
+    if training_args.do_train:
+        train_result = trainer.train(resume_from_checkpoint=checkpoint)
+        metrics = train_result.metrics
+        trainer.save_model()
+        trainer.log_metrics("train", metrics)
+        trainer.save_metrics("train", metrics)
+        trainer.save_state()
+    # Evaluate and tests model
+    if training_args.do_eval:
+        eval_metrics = trainer.evaluate()
+        trainer.log_metrics("eval", eval_metrics)
+    # export inference model
+    if training_args.do_export:
+        # You can also load from certain checkpoint
+        # trainer.load_state_dict_from_checkpoint("/path/to/checkpoint/")
+        if training_args.device == "npu":
+            # npu will transform int64 to int32 for internal calculation.
+            # To reduce useless transformation, we feed int32 inputs.
+            input_spec_dtype = "int32"
+        else:
+            input_spec_dtype = "int64"
+        if model_args.multilingual:
+            input_spec = [
+                paddle.static.InputSpec(shape=[None, None], dtype=input_spec_dtype, name="input_ids"),
+                paddle.static.InputSpec(shape=[None, None], dtype=input_spec_dtype, name="position_ids"),
+            ]
+        else:
+            input_spec = [
+                paddle.static.InputSpec(shape=[None, None], dtype=input_spec_dtype, name="input_ids"),
+                paddle.static.InputSpec(shape=[None, None], dtype=input_spec_dtype, name="token_type_ids"),
+                paddle.static.InputSpec(shape=[None, None], dtype=input_spec_dtype, name="position_ids"),
+                paddle.static.InputSpec(shape=[None, None], dtype=input_spec_dtype, name="attention_mask"),
+            ]
+        if model_args.export_model_dir is None:
+            model_args.export_model_dir = os.path.join(training_args.output_dir, "export")
+        export_model(model=trainer.model, input_spec=input_spec, path=model_args.export_model_dir)
+    if training_args.do_compress:
+        @paddle.no_grad()
+        def custom_evaluate(self, model, data_loader):
+            metric = SpanEvaluator()
+            model.eval()
+            metric.reset()
+            for batch in data_loader:
+                if model_args.multilingual:
+                    logits = model(input_ids=batch["input_ids"], position_ids=batch["position_ids"])
+                else:
+                    logits = model(
+                        input_ids=batch["input_ids"],
+                        token_type_ids=batch["token_type_ids"],
+                        position_ids=batch["position_ids"],
+                        attention_mask=batch["attention_mask"],
+                    )
+                start_prob, end_prob = logits
+                start_ids, end_ids = batch["start_positions"], batch["end_positions"]
+                num_correct, num_infer, num_label = metric.compute(start_prob, end_prob, start_ids, end_ids)
+                metric.update(num_correct, num_infer, num_label)
+            precision, recall, f1 = metric.accumulate()
+            logger.info("f1: %s, precision: %s, recall: %s" % (f1, precision, f1))
+            model.train()
+            return f1
+        trainer.compress(custom_evaluate=custom_evaluate)
+if __name__ == "__main__":
+    main()
--- a/applications/information_extraction/text/utils.py
+++ b/applications/information_extraction/text/utils.py
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import json
+import random
+from typing import List, Optional
+import numpy as np
+import paddle
+from paddlenlp.utils.log import logger
+def set_seed(seed):
+    paddle.seed(seed)
+    random.seed(seed)
+    np.random.seed(seed)
+def create_data_loader(dataset, mode="train", batch_size=1, trans_fn=None):
+    """
+    Create dataloader.
+    Args:
+        dataset(obj:`paddle.io.Dataset`): Dataset instance.
+        mode(obj:`str`, optional, defaults to obj:`train`): If mode is 'train', it will shuffle the dataset randomly.
+        batch_size(obj:`int`, optional, defaults to 1): The sample number of a mini-batch.
+        trans_fn(obj:`callable`, optional, defaults to `None`): function to convert a data sample to input ids, etc.
+    Returns:
+        dataloader(obj:`paddle.io.DataLoader`): The dataloader which generates batches.
+    """
+    if trans_fn:
+        dataset = dataset.map(trans_fn)
+    shuffle = True if mode == "train" else False
+    if mode == "train":
+        sampler = paddle.io.DistributedBatchSampler(dataset=dataset, batch_size=batch_size, shuffle=shuffle)
+    else:
+        sampler = paddle.io.BatchSampler(dataset=dataset, batch_size=batch_size, shuffle=shuffle)
+    dataloader = paddle.io.DataLoader(dataset, batch_sampler=sampler, return_list=True)
+    return dataloader
+def map_offset(ori_offset, offset_mapping):
+    """
+    map ori offset to token offset
+    """
+    for index, span in enumerate(offset_mapping):
+        if span[0] <= ori_offset < span[1]:
+            return index
+    return -1
+def reader(data_path, max_seq_len=512):
+    """
+    read json
+    """
+    with open(data_path, "r", encoding="utf-8") as f:
+        for line in f:
+            json_line = json.loads(line)
+            content = json_line["content"].strip()
+            prompt = json_line["prompt"]
+            # Model Input is aslike: [CLS] Prompt [SEP] Content [SEP]
+            # It include three summary tokens.
+            if max_seq_len <= len(prompt) + 3:
+                raise ValueError("The value of max_seq_len is too small, please set a larger value")
+            max_content_len = max_seq_len - len(prompt) - 3
+            if len(content) <= max_content_len:
+                yield json_line
+            else:
+                result_list = json_line["result_list"]
+                json_lines = []
+                accumulate = 0
+                while True:
+                    cur_result_list = []
+                    for result in result_list:
+                        if result["end"] - result["start"] > max_content_len:
+                            logger.warning(
+                                "result['end'] - result ['start'] exceeds max_content_len, which will result in no valid instance being returned"
+                            )
+                        if (
+                            result["start"] + 1 <= max_content_len < result["end"]
+                            and result["end"] - result["start"] <= max_content_len
+                        ):
+                            max_content_len = result["start"]
+                            break
+                    cur_content = content[:max_content_len]
+                    res_content = content[max_content_len:]
+                    while True:
+                        if len(result_list) == 0:
+                            break
+                        elif result_list[0]["end"] <= max_content_len:
+                            if result_list[0]["end"] > 0:
+                                cur_result = result_list.pop(0)
+                                cur_result_list.append(cur_result)
+                            else:
+                                cur_result_list = [result for result in result_list]
+                                break
+                        else:
+                            break
+                    json_line = {"content": cur_content, "result_list": cur_result_list, "prompt": prompt}
+                    json_lines.append(json_line)
+                    for result in result_list:
+                        if result["end"] <= 0:
+                            break
+                        result["start"] -= max_content_len
+                        result["end"] -= max_content_len
+                    accumulate += max_content_len
+                    max_content_len = max_seq_len - len(prompt) - 3
+                    if len(res_content) == 0:
+                        break
+                    elif len(res_content) < max_content_len:
+                        json_line = {"content": res_content, "result_list": result_list, "prompt": prompt}
+                        json_lines.append(json_line)
+                        break
+                    else:
+                        content = res_content
+                for json_line in json_lines:
+                    yield json_line
+def get_dynamic_max_length(examples, default_max_length: int, dynamic_max_length: List[int]) -> int:
+    """get max_length by examples which you can change it by examples in batch"""
+    cur_length = len(examples[0]["input_ids"])
+    max_length = default_max_length
+    for max_length_option in sorted(dynamic_max_length):
+        if cur_length <= max_length_option:
+            max_length = max_length_option
+            break
+    return max_length
+def convert_example(
+    example, tokenizer, max_seq_len, multilingual=False, dynamic_max_length: Optional[List[int]] = None
+):
+    """
+    example: {
+        title
+        prompt
+        content
+        result_list
+    }
+    """
+    if dynamic_max_length is not None:
+        temp_encoded_inputs = tokenizer(
+            text=[example["prompt"]],
+            text_pair=[example["content"]],
+            truncation=True,
+            max_seq_len=max_seq_len,
+            return_attention_mask=True,
+            return_position_ids=True,
+            return_dict=False,
+            return_offsets_mapping=True,
+        )
+        max_length = get_dynamic_max_length(
+            examples=temp_encoded_inputs, default_max_length=max_seq_len, dynamic_max_length=dynamic_max_length
+        )
+        # always pad to max_length
+        encoded_inputs = tokenizer(
+            text=[example["prompt"]],
+            text_pair=[example["content"]],
+            truncation=True,
+            max_seq_len=max_length,
+            pad_to_max_seq_len=True,
+            return_attention_mask=True,
+            return_position_ids=True,
+            return_dict=False,
+            return_offsets_mapping=True,
+        )
+        start_ids = [0.0 for x in range(max_length)]
+        end_ids = [0.0 for x in range(max_length)]
+    else:
+        encoded_inputs = tokenizer(
+            text=[example["prompt"]],
+            text_pair=[example["content"]],
+            truncation=True,
+            max_seq_len=max_seq_len,
+            pad_to_max_seq_len=True,
+            return_attention_mask=True,
+            return_position_ids=True,
+            return_dict=False,
+            return_offsets_mapping=True,
+        )
+        start_ids = [0.0 for x in range(max_seq_len)]
+        end_ids = [0.0 for x in range(max_seq_len)]
+    encoded_inputs = encoded_inputs[0]
+    offset_mapping = [list(x) for x in encoded_inputs["offset_mapping"]]
+    bias = 0
+    for index in range(1, len(offset_mapping)):
+        mapping = offset_mapping[index]
+        if mapping[0] == 0 and mapping[1] == 0 and bias == 0:
+            bias = offset_mapping[index - 1][1] + 1  # Includes [SEP] token
+        if mapping[0] == 0 and mapping[1] == 0:
+            continue
+        offset_mapping[index][0] += bias
+        offset_mapping[index][1] += bias
+    for item in example["result_list"]:
+        start = map_offset(item["start"] + bias, offset_mapping)
+        end = map_offset(item["end"] - 1 + bias, offset_mapping)
+        start_ids[start] = 1.0
+        end_ids[end] = 1.0
+    if multilingual:
+        tokenized_output = {
+            "input_ids": encoded_inputs["input_ids"],
+            "position_ids": encoded_inputs["position_ids"],
+            "start_positions": start_ids,
+            "end_positions": end_ids,
+        }
+    else:
+        tokenized_output = {
+            "input_ids": encoded_inputs["input_ids"],
+            "token_type_ids": encoded_inputs["token_type_ids"],
+            "position_ids": encoded_inputs["position_ids"],
+            "attention_mask": encoded_inputs["attention_mask"],
+            "start_positions": start_ids,
+            "end_positions": end_ids,
+        }
+    return tokenized_output