Commit 10f294ff authored by yuguo-Jack's avatar yuguo-Jack
Browse files

llama_paddle

parent 7c64e6ec
Pipeline #678 failed with stages
in 0 seconds
# UIE Slim data distillation
While UIE has powerful zero-shot extraction capabilities, its prompting structure requires significant compute to serve in real time. Some industrial application scenarios have high inference performance requirements and the model cannot go into production without being effectively compressed. We built the UIE Slim Data Distillation with knowledge distillation techniques. The principle is to use the data as a bridge to transfer the knowledge of the UIE model to the smaller closed-domain information extraction model in order to achieve speedup inference significantly with minimal loss to accuracy.
#### Three steps of UIE data distillation
- **Step 1**: Finetune the UIE model on the labeled data to get the Teacher Model.
- **Step 2**: Process the user-provided unlabeled data and run inference with Taskflow UIE.
- **Step 3**: Use the labeled data and the inference results obtained in step 2 to train a closed-domain Student Model.
## UIE Finetune
Refer to [UIE relationship extraction fine-tuning](../README.md) to complete the model fine-tuning and get ``../checkpoint/model_best``.
## Offline Distillation
#### Predict the label of unsupervised data through the trained UIE custom model
```shell
python data_distill.py \
--data_path ../data \
--save_dir student_data \
--task_type relation_extraction \
--synthetic_ratio 10 \
--model_path ../checkpoint/model_best
```
**NOTE**: The schema needs to be configured in `data_distill.py` according to the label data, and the schema needs to contain all label types in the label data.
Description of configurable parameters:
- `data_path`: Path to labeled data (`doccano_ext.json`) and unsupervised text (`unlabeled_data.txt`).
- `model_path`: The path of the trained UIE custom model.
- `save_dir`: The path to save the training data of the student model.
- `synthetic_ratio`: Controls the ratio of synthetic data. The maximum number of synthetic data=synthetic_ratio*number of labeled data.
- `platform`: The labeling platform used to label data, optional are `doccano`, `label_studio`, the default is `label_studio`.
- `task_type`: Select the task type, optional are `entity_extraction`, `relation_extraction`, `event_extraction` and `opinion_extraction`. Because it is a closed-domain extraction, the post-processing logic of different tasks is different, so the task type needs to be specified.
- `seed`: random seed, default is 1000.
#### Teacher model evaluation
In the UIE fine-tuning stage, the model performance is evaluated on UIE training format data, which is not a standard end-to-end evaluation method for relation extraction or event extraction. The end-to-end evaluation can be performed through the following evaluation script.
```shell
python evaluate_teacher.py \
--task_type relation_extraction \
--test_path ./student_data/dev_data.json\
--label_maps_path ./student_data/label_maps.json \
--model_path ../checkpoint/model_best
```
Description of configurable parameters:
- `model_path`: The path of the trained UIE custom model.
- `test_path`: test dataset path.
- `label_maps_path`: dictionary of student model labels.
- `batch_size`: batch size, default is 8.
- `max_seq_len`: Maximum text length, default is 256.
- `task_type`: Select the task type, optional are `entity_extraction`, `relation_extraction`, `event_extraction` and `opinion_extraction`. Because it is an evaluation of closed-domain information extraction, the task type needs to be specified.
#### Student model training
```shell
python train.py\
--task_type relation_extraction \
--train_path student_data/train_data.json \
--dev_path student_data/dev_data.json \
--label_maps_path student_data/label_maps.json \
--num_epochs 50 \
--encoder ernie-3.0-mini-zh
```
Description of configurable parameters:
- `train_path`: training set file path.
- `dev_path`: Validation set file path.
- `batch_size`: batch size, default is 16.
- `learning_rate`: Learning rate, default is 3e-5.
- `save_dir`: model storage path, the default is `./checkpoint`.
- `max_seq_len`: Maximum text length, default is 256.
- `weight_decay`: Indicates the coefficient of weight_decay used in the AdamW optimizer.
- `warmup_proportion`: The proportion of the learning rate warmup strategy. If it is 0.1, the learning rate will slowly increase from 0 to learning_rate during the first 10% training step, and then slowly decay. The default is 0.0.
- `num_epochs`: The number of training epochs, the default is 100.
- `seed`: random seed, default is 1000.
- `encoder`: select the model base of the student model, the default is `ernie-3.0-mini-zh`.
- `task_type`: Select the task type, optional are `entity_extraction`, `relation_extraction`, `event_extraction` and `opinion_extraction`. Because it is closed-domain information extraction, the task type needs to be specified.
- `logging_steps`: The interval steps of log printing, the default is 10.
- `eval_steps`: The interval steps of evaluate, the default is 200.
- `device`: What device to choose for training, optional cpu or gpu.
- `init_from_ckpt`: optional, model parameter path, hot start model training; default is None.
#### Student model evaluation
```shell
python evaluate.py \
--model_path ./checkpoint/model_best \
--test_path student_data/dev_data.json \
--task_type relation_extraction \
--label_maps_path student_data/label_maps.json \
--encoder ernie-3.0-mini-zh
```
Description of configurable parameters:
- `model_path`: The path of the trained UIE custom model.
- `test_path`: test dataset path.
- `label_maps_path`: dictionary of student model labels.
- `batch_size`: batch size, default is 8.
- `max_seq_len`: Maximum text length, default is 256.
- `encoder`: select the model base of the student model, the default is `ernie-3.0-mini-zh`.
- `task_type`: Select the task type, optional are `entity_extraction`, `relation_extraction`, `event_extraction` and `opinion_extraction`. Because it is an evaluation of closed-domain information extraction, the task type needs to be specified.
## Student model deployment
- Fast deployment of the closed-domain information extraction model through Taskflow, `task_path` is the path of the student model.
```python
>>> from pprint import pprint
>>> from paddlenlp import Taskflow
>>> my_ie = Taskflow("information_extraction", model="uie-data-distill-gp", task_path="checkpoint/model_best/") # Schema is fixed in closed-domain information extraction
>>> pprint(my_ie("Virgo deceleration bomb was developed by the Swedish FFV Ordnance Company specially for the attack aircraft of the Swedish Royal Air Force to carry out low-altitude and high-speed bombing. It was developed in 1956 and entered service in 1963. It is equipped on the A32 "Contradiction", A35 "Dragon", and AJ134 "Thunder" attack aircraft are mainly used to attack landing craft, parked aircraft, anti-aircraft artillery, field artillery, light armored vehicles and active forces."))
[{'weapon name': [{'end': 14,
'probability': 0.9976037,
'relations': {'country of origin': [{'end': 18,
'probability': 0.9988706,
'relations': {},
'start': 16,
'text': 'Sweden'}],
'R&D unit': [{'end': 25,
'probability': 0.9978277,
'relations': {},
'start': 18,
'text': 'FFV Ordnance Company'}],
'type': [{'end': 14,
'probability': 0.99837446,
'relations': {},
'start': 12,
'text': 'bomb'}]},
'start': 0,
'text': 'Virgo slowing bomb'}]}]
```
# References
- **[GlobalPointer](https://kexue.fm/search/globalpointer/)**
- **[GPLinker](https://kexue.fm/archives/8888)**
- **[JunnYu/GPLinker_pytorch](https://github.com/JunnYu/GPLinker_pytorch**
# coding=utf-8
# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import numpy as np
import paddle
import paddle.nn as nn
class Criterion(nn.Layer):
"""Criterion for GPNet"""
def __init__(self, mask_zero=True):
self.mask_zero = mask_zero
def _sparse_multilabel_categorical_crossentropy(self, y_true, y_pred, mask_zero=False):
"""Sparse multi-label categorical cross entropy
reference to "https://kexue.fm/archives/7359".
"""
zeros = paddle.zeros_like(y_pred[..., :1])
y_pred = paddle.concat([y_pred, zeros], axis=-1)
if mask_zero:
infs = zeros + 1e12
y_pred = paddle.concat([infs, y_pred[..., 1:]], axis=-1)
y_pos_2 = paddle.take_along_axis(y_pred, y_true, axis=-1)
y_pos_1 = paddle.concat([y_pos_2, zeros], axis=-1)
if mask_zero:
y_pred = paddle.concat([-infs, y_pred[..., 1:]], axis=-1)
y_pos_2 = paddle.take_along_axis(y_pred, y_true, axis=-1)
pos_loss = (-y_pos_1).exp().sum(axis=-1).log()
all_loss = y_pred.exp().sum(axis=-1).log()
aux_loss = y_pos_2.exp().sum(axis=-1).log() - all_loss
aux_loss = paddle.clip(1 - paddle.exp(aux_loss), min=0.1, max=1)
neg_loss = all_loss + paddle.log(aux_loss)
return pos_loss + neg_loss
def __call__(self, y_pred, y_true):
shape = y_pred.shape
y_true = y_true[..., 0] * shape[2] + y_true[..., 1]
# bs, nclass, seqlen * seqlen
y_pred = paddle.reshape(y_pred, shape=[shape[0], -1, np.prod(shape[2:])])
loss = self._sparse_multilabel_categorical_crossentropy(y_true, y_pred, self.mask_zero)
return loss.sum(axis=1).mean()
# coding=utf-8
# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from dataclasses import dataclass
from typing import Dict, List, Optional, Union
import paddle
from paddlenlp.transformers.tokenizer_utils_base import (
PaddingStrategy,
PretrainedTokenizerBase,
)
ignore_list = ["offset_mapping", "text"]
@dataclass
class DataCollator:
tokenizer: PretrainedTokenizerBase
padding: Union[bool, str, PaddingStrategy] = True
max_length: Optional[int] = None
label_maps: Optional[dict] = None
task_type: Optional[str] = None
def __call__(self, features: List[Dict[str, Union[List[int], paddle.Tensor]]]) -> Dict[str, paddle.Tensor]:
labels = [feature["labels"] for feature in features] if "labels" in features[0].keys() else None
new_features = [{k: v for k, v in f.items() if k not in ["labels"] + ignore_list} for f in features]
batch = self.tokenizer.pad(
new_features,
padding=self.padding,
)
batch = [paddle.to_tensor(batch[k]) for k in batch.keys()]
if labels is None: # for test
if "offset_mapping" in features[0].keys():
batch.append([feature["offset_mapping"] for feature in features])
if "text" in features[0].keys():
batch.append([feature["text"] for feature in features])
return batch
bs = batch[0].shape[0]
if self.task_type == "entity_extraction":
# Ensure the dimension is greater or equal to 1
max_ent_num = max(max([len(lb["ent_labels"]) for lb in labels]), 1)
num_ents = len(self.label_maps["entity2id"])
batch_entity_labels = paddle.zeros(shape=[bs, num_ents, max_ent_num, 2], dtype="int64")
for i, lb in enumerate(labels):
for eidx, (l, eh, et) in enumerate(lb["ent_labels"]):
batch_entity_labels[i, l, eidx, :] = paddle.to_tensor([eh, et])
batch.append([batch_entity_labels])
else:
# Ensure the dimension is greater or equal to 1
max_ent_num = max(max([len(lb["ent_labels"]) for lb in labels]), 1)
max_spo_num = max(max([len(lb["rel_labels"]) for lb in labels]), 1)
num_ents = len(self.label_maps["entity2id"])
if "relation2id" in self.label_maps.keys():
num_rels = len(self.label_maps["relation2id"])
else:
num_rels = len(self.label_maps["sentiment2id"])
batch_entity_labels = paddle.zeros(shape=[bs, num_ents, max_ent_num, 2], dtype="int64")
batch_head_labels = paddle.zeros(shape=[bs, num_rels, max_spo_num, 2], dtype="int64")
batch_tail_labels = paddle.zeros(shape=[bs, num_rels, max_spo_num, 2], dtype="int64")
for i, lb in enumerate(labels):
for eidx, (l, eh, et) in enumerate(lb["ent_labels"]):
batch_entity_labels[i, l, eidx, :] = paddle.to_tensor([eh, et])
for spidx, (sh, st, p, oh, ot) in enumerate(lb["rel_labels"]):
batch_head_labels[i, p, spidx, :] = paddle.to_tensor([sh, oh])
batch_tail_labels[i, p, spidx, :] = paddle.to_tensor([st, ot])
batch.append([batch_entity_labels, batch_head_labels, batch_tail_labels])
return batch
# coding=utf-8
# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import argparse
import json
import math
import os
import random
from tqdm import tqdm
from utils import anno2distill, schema2label_maps, set_seed, synthetic2distill
from paddlenlp import Taskflow
from paddlenlp.utils.log import logger
def do_data_distill():
set_seed(args.seed)
# Generate closed-domain label maps
if not os.path.exists(args.save_dir):
os.mkdir(args.save_dir)
label_maps = schema2label_maps(args.task_type, schema=args.schema)
label_maps_path = os.path.join(args.save_dir, "label_maps.json")
# Save closed-domain label maps file
with open(label_maps_path, "w", encoding="utf-8") as fp:
fp.write(json.dumps(label_maps, ensure_ascii=False))
# Load doccano file and convert to distill format
sample_index = json.loads(
open(os.path.join(args.data_path, "sample_index.json"), "r", encoding="utf-8").readline()
)
train_ids = sample_index["train_ids"]
dev_ids = sample_index["dev_ids"]
test_ids = sample_index["test_ids"]
if args.platform == "label_studio":
with open(os.path.join(args.data_path, "label_studio.json"), "r", encoding="utf-8") as fp:
json_lines = json.loads(fp.read())
elif args.platform == "doccano":
json_lines = []
with open(os.path.join(args.data_path, "doccano_ext.json"), "r", encoding="utf-8") as fp:
for line in fp:
json_lines.append(json.loads(line))
else:
raise ValueError("Unsupported annotation platform!")
train_lines = [json_lines[i] for i in train_ids]
train_lines = anno2distill(train_lines, args.task_type, label_maps, args.platform)
dev_lines = [json_lines[i] for i in dev_ids]
dev_lines = anno2distill(dev_lines, args.task_type, label_maps, args.platform)
test_lines = [json_lines[i] for i in test_ids]
test_lines = anno2distill(test_lines, args.task_type, label_maps, args.platform)
# Load trained UIE model
uie = Taskflow("information_extraction", schema=args.schema, task_path=args.model_path)
if args.synthetic_ratio > 0:
# Generate synthetic data
texts = open(os.path.join(args.data_path, "unlabeled_data.txt"), "r", encoding="utf-8").readlines()
actual_ratio = math.ceil(len(texts) / len(train_lines))
if actual_ratio <= args.synthetic_ratio or args.synthetic_ratio == -1:
infer_texts = texts
else:
idxs = random.sample(range(0, len(texts)), args.synthetic_ratio * len(train_lines))
infer_texts = [texts[i] for i in idxs]
infer_results = []
for text in tqdm(infer_texts, desc="Predicting: ", leave=False):
infer_results.extend(uie(text))
train_synthetic_lines = synthetic2distill(infer_texts, infer_results, args.task_type)
# Concat origin and synthetic data
train_lines.extend(train_synthetic_lines)
def _save_examples(save_dir, file_name, examples):
count = 0
save_path = os.path.join(save_dir, file_name)
with open(save_path, "w", encoding="utf-8") as f:
for example in examples:
f.write(json.dumps(example, ensure_ascii=False) + "\n")
count += 1
logger.info("Save %d examples to %s." % (count, save_path))
_save_examples(args.save_dir, "train_data.json", train_lines)
_save_examples(args.save_dir, "dev_data.json", dev_lines)
_save_examples(args.save_dir, "test_data.json", test_lines)
if __name__ == "__main__":
# yapf: disable
parser = argparse.ArgumentParser()
parser.add_argument("--data_path", default="../data", type=str, help="The directory for labeled data with doccano format and the large scale unlabeled data.")
parser.add_argument("--model_path", type=str, default="../checkpoint/model_best", help="The path of saved model that you want to load.")
parser.add_argument("--save_dir", default="./distill_task", type=str, help="The path of data that you wanna save.")
parser.add_argument("--synthetic_ratio", default=10, type=int, help="The ratio of labeled and synthetic samples.")
parser.add_argument("--task_type", choices=['relation_extraction', 'event_extraction', 'entity_extraction', 'opinion_extraction'], default="entity_extraction", type=str, help="Select the training task type.")
parser.add_argument("--seed", type=int, default=1000, help="Random seed for initialization")
parser.add_argument("--platform", choices=['doccano', 'label_studio'], type=str, default="label_studio", help="Select the annotation platform.")
args = parser.parse_args()
# yapf: enable
# Define your schema here
schema = {"武器名称": ["产国", "类型", "研发单位"]}
args.schema = schema
do_data_distill()
# 基于PaddleNLP SimpleServing 的服务化部署
## 目录
- [环境准备](#环境准备)
- [Server服务启动](#Server服务启动)
- [Client请求启动](#Client请求启动)
- [服务化自定义参数](#服务化自定义参数)
## 环境准备
使用有SimpleServing功能的PaddleNLP版本(或者最新的develop版本)
```shell
pip install paddlenlp >= 2.4.4
```
## Server服务启动
```bash
paddlenlp server server:app --workers 1 --host 0.0.0.0 --port 8189
```
## Client请求启动
```bash
python client.py
```
## 服务化自定义参数
### Server 自定义参数
#### schema替换
```python
# Default schema
schema = {"武器名称": ["产国", "类型", "研发单位"]}
```
#### 设置模型路径
```
# Default task_path
uie = Taskflow('information_extraction', model='uie-data-distill-gp', task_path='../../checkpoint/model_best/', schema=schema)
```
#### 多卡服务化预测
PaddleNLP SimpleServing 支持多卡负载均衡预测,主要在服务化注册的时候,注册两个Taskflow的task即可,下面是示例代码
```
uie1 = Taskflow('information_extraction', model='uie-data-distill-gp', task_path='../../checkpoint/model_best/', schema=schema, device_id=0)
uie2 = Taskflow('information_extraction', model='uie-data-distill-gp', task_path='../../checkpoint/model_best/', schema=schema, device_id=1)
service.register_taskflow('uie', [uie1, uie2])
```
### Client 自定义参数
```python
# Changed to input texts you wanted
texts = ['威尔哥(Virgo)减速炸弹是由瑞典FFV军械公司专门为瑞典皇家空军的攻击机实施低空高速轰炸而研制,1956年开始研制,1963年进入服役,装备于A32“矛盾”、A35“龙”、和AJ134“雷”攻击机,主要用于攻击登陆艇、停放的飞机、高炮、野战火炮、轻型防护装甲车辆以及有生力量。']
```
# Service deployment based on PaddleNLP SimpleServing
- [Environment Preparation](#1)
- [Server](#2)
- [Client](#3)
- [Service Custom Parameters](#4)
<a name="1"></a>
## Environment Preparation
Use the PaddleNLP version with SimpleServing function (or the latest develop version)
```shell
pip install paddlenlp >= 2.4.4
```
<a name="2"></a>
## Server
```bash
paddlenlp server server:app --workers 1 --host 0.0.0.0 --port 8189
```
<a name="3"></a>
## Client
```bash
python client.py
```
<a name="4"></a>
## Service Custom Parameters
### Server Custom Parameters
#### schema replacement
```python
# Default schema
schema = {"Weapon Name": ["Country of Production", "Type", "R&D Unit"]}
```
#### Set model path
```
# Default task_path
uie = Taskflow('information_extraction', model='uie-data-distill-gp', task_path='../../checkpoint/model_best/', schema=schema)
```
#### Doka Service Prediction
PaddleNLP SimpleServing supports multi-card load balancing prediction, mainly during service registration, just register two Taskflow tasks, the following is the sample code
```
uie1 = Taskflow('information_extraction', task_path='../../checkpoint/model_best/', schema=schema, device_id=0)
uie2 = Taskflow('information_extraction', task_path='../../checkpoint/model_best/', schema=schema, device_id=1)
service. register_taskflow('uie', [uie1, uie2])
```
### Client Custom Parameters
```python
# Changed to input texts you wanted
texts = ['威尔哥(Virgo)减速炸弹是由瑞典FFV军械公司专门为瑞典皇家空军的攻击机实施低空高速轰炸而研制,1956年开始研制,1963年进入服役,装备于A32“矛盾”、A35“龙”、和AJ134“雷”攻击机,主要用于攻击登陆艇、停放的飞机、高炮、野战火炮、轻型防护装甲车辆以及有生力量。']
```
# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import json
import requests
url = "http://0.0.0.0:8189/taskflow/uie"
headers = {"Content-Type": "application/json"}
texts = [
"威尔哥(Virgo)减速炸弹是由瑞典FFV军械公司专门为瑞典皇家空军的攻击机实施低空高速轰炸而研制,1956年开始研制,1963年进入服役,装备于A32“矛盾”、A35“龙”、和AJ134“雷”攻击机,主要用于攻击登陆>艇、停放的飞机、高炮、野战火炮、轻型防护装甲车辆以及有生力量。"
]
data = {"data": {"text": texts}}
r = requests.post(url=url, headers=headers, data=json.dumps(data))
datas = json.loads(r.text)
print(datas)
# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from paddlenlp import SimpleServer, Taskflow
# The schema changed to your defined schema
schema = {"武器名称": ["产国", "类型", "研发单位"]}
# The task path changed to your best model path
uie = Taskflow(
"information_extraction", model="uie-data-distill-gp", schema=schema, task_path="../../checkpoint/model_best/"
)
# If you want to define the finetuned uie service
app = SimpleServer()
app.register_taskflow("taskflow/uie", uie)
# coding=utf-8
# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import argparse
import os
import paddle
from metric import get_eval
from tqdm import tqdm
from utils import create_dataloader, get_label_maps, postprocess, reader
from paddlenlp.datasets import load_dataset
from paddlenlp.layers import (
GlobalPointerForEntityExtraction,
GPLinkerForRelationExtraction,
)
from paddlenlp.transformers import AutoModel, AutoTokenizer
from paddlenlp.utils.log import logger
@paddle.no_grad()
def evaluate(model, dataloader, label_maps, task_type="relation_extraction"):
model.eval()
all_preds = ([], []) if task_type in ["opinion_extraction", "relation_extraction", "event_extraction"] else []
for batch in tqdm(dataloader, desc="Evaluating: ", leave=False):
input_ids, attention_masks, offset_mappings, texts = batch
logits = model(input_ids, attention_masks)
batch_outputs = postprocess(logits, offset_mappings, texts, label_maps, task_type)
if isinstance(batch_outputs, tuple):
all_preds[0].extend(batch_outputs[0]) # Entity output
all_preds[1].extend(batch_outputs[1]) # Relation output
else:
all_preds.extend(batch_outputs)
eval_results = get_eval(all_preds, dataloader.dataset.raw_data, task_type)
model.train()
return eval_results
def do_eval():
label_maps = get_label_maps(args.task_type, args.label_maps_path)
tokenizer = AutoTokenizer.from_pretrained(args.encoder)
encoder = AutoModel.from_pretrained(args.encoder)
if args.task_type == "entity_extraction":
model = GlobalPointerForEntityExtraction(encoder, label_maps)
else:
model = GPLinkerForRelationExtraction(encoder, label_maps)
if args.model_path:
state_dict = paddle.load(os.path.join(args.model_path, "model_state.pdparams"))
model.set_dict(state_dict)
test_ds = load_dataset(reader, data_path=args.test_path, lazy=False)
test_dataloader = create_dataloader(
test_ds,
tokenizer,
max_seq_len=args.max_seq_len,
batch_size=args.batch_size,
label_maps=label_maps,
mode="test",
task_type=args.task_type,
)
eval_result = evaluate(model, test_dataloader, label_maps, task_type=args.task_type)
logger.info("Evaluation precision: " + str(eval_result))
if __name__ == "__main__":
# yapf: disable
parser = argparse.ArgumentParser()
parser.add_argument("--model_path", type=str, default=None, help="The path of saved model that you want to load.")
parser.add_argument("--test_path", type=str, default=None, help="The path of test set.")
parser.add_argument("--encoder", default="ernie-3.0-mini-zh", type=str, help="Select the pretrained encoder model for GP.")
parser.add_argument("--label_maps_path", default="./ner_data/label_maps.json", type=str, help="The file path of the labels dictionary.")
parser.add_argument("--batch_size", type=int, default=16, help="Batch size per GPU/CPU for training.")
parser.add_argument("--max_seq_len", type=int, default=128, help="The maximum total input sequence length after tokenization.")
parser.add_argument("--task_type", choices=['relation_extraction', 'event_extraction', 'entity_extraction', 'opinion_extraction'], default="entity_extraction", type=str, help="Select the training task type.")
args = parser.parse_args()
# yapf: enable
do_eval()
# coding=utf-8
# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import argparse
import paddle
from metric import get_eval
from tqdm import tqdm
from utils import create_dataloader, get_label_maps, reader, synthetic2distill
from paddlenlp import Taskflow
from paddlenlp.datasets import load_dataset
from paddlenlp.transformers import AutoTokenizer
from paddlenlp.utils.log import logger
@paddle.no_grad()
def evaluate(uie, dataloader, task_type="relation_extraction"):
all_preds = ([], []) if task_type in ["opinion_extraction", "relation_extraction", "event_extraction"] else []
infer_results = []
all_texts = []
for batch in tqdm(dataloader, desc="Evaluating: ", leave=False):
_, _, _, texts = batch
all_texts.extend(texts)
infer_results.extend(uie(texts))
infer_results = synthetic2distill(all_texts, infer_results, task_type)
for res in infer_results:
if task_type == "entity_extraction":
all_preds.append(res["entity_list"])
else:
all_preds[0].append(res["entity_list"])
all_preds[1].append(res["spo_list"])
eval_results = get_eval(all_preds, dataloader.dataset.raw_data, task_type)
return eval_results
def do_eval():
# Load trained UIE model
uie = Taskflow("information_extraction", schema=args.schema, batch_size=args.batch_size, task_path=args.model_path)
label_maps = get_label_maps(args.task_type, args.label_maps_path)
tokenizer = AutoTokenizer.from_pretrained("ernie-3.0-base-zh")
test_ds = load_dataset(reader, data_path=args.test_path, lazy=False)
test_dataloader = create_dataloader(
test_ds,
tokenizer,
max_seq_len=args.max_seq_len,
batch_size=args.batch_size,
label_maps=label_maps,
mode="test",
task_type=args.task_type,
)
eval_result = evaluate(uie, test_dataloader, task_type=args.task_type)
logger.info("Evaluation precision: " + str(eval_result))
if __name__ == "__main__":
# yapf: disable
parser = argparse.ArgumentParser()
parser.add_argument("--model_path", type=str, default=None, help="The path of saved model that you want to load.")
parser.add_argument("--test_path", type=str, default=None, help="The path of test set.")
parser.add_argument("--label_maps_path", default="./ner_data/label_maps.json", type=str, help="The file path of the labels dictionary.")
parser.add_argument("--batch_size", type=int, default=8, help="Batch size per GPU/CPU for training.")
parser.add_argument("--max_seq_len", type=int, default=256, help="The maximum total input sequence length after tokenization.")
parser.add_argument("--task_type", choices=['relation_extraction', 'event_extraction', 'entity_extraction', 'opinion_extraction'], default="entity_extraction", type=str, help="Select the training task type.")
args = parser.parse_args()
# yapf: enable
schema = {"武器名称": ["产国", "类型", "研发单位"]}
args.schema = schema
do_eval()
# coding=utf-8
# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
def get_eval(all_preds, raw_data, task_type):
if task_type == "entity_extraction":
ex, ey, ez = 1e-10, 1e-10, 1e-10
for ent_preds, data in zip(all_preds, raw_data):
pred_ent_set = set([tuple(p.values()) for p in ent_preds])
gold_ent_set = set([tuple(g.values()) for g in data["entity_list"]])
ex += len(pred_ent_set & gold_ent_set)
ey += len(pred_ent_set)
ez += len(gold_ent_set)
ent_f1 = round(2 * ex / (ey + ez), 5) if ex != 1e-10 else 0.0
ent_precision = round(ex / ey, 5) if ey != 1e-10 else 0.0
ent_recall = round(ex / ez, 5) if ez != 1e-10 else 0.0
return {
"entity_f1": ent_f1,
"entity_precision": ent_precision,
"entity_recall": ent_recall,
}
else:
all_ent_preds, all_rel_preds = all_preds
ex, ey, ez = 1e-10, 1e-10, 1e-10
for ent_preds, data in zip(all_ent_preds, raw_data):
pred_ent_set = set([tuple(p.values()) for p in ent_preds])
gold_ent_set = set([tuple(g.values()) for g in data["entity_list"]])
ex += len(pred_ent_set & gold_ent_set)
ey += len(pred_ent_set)
ez += len(gold_ent_set)
ent_f1 = round(2 * ex / (ey + ez), 5) if ex != 1e-10 else 0.0
ent_precision = round(ex / ey, 5) if ey != 1e-10 else 0.0
ent_recall = round(ex / ez, 5) if ez != 1e-10 else 0.0
rx, ry, rz = 1e-10, 1e-10, 1e-10
for rel_preds, raw_data in zip(all_rel_preds, raw_data):
pred_rel_set = set([tuple(p.values()) for p in rel_preds])
if task_type == "opinion_extraction":
gold_rel_set = set([tuple(g.values()) for g in raw_data["aso_list"]])
else:
gold_rel_set = set([tuple(g.values()) for g in raw_data["spo_list"]])
rx += len(pred_rel_set & gold_rel_set)
ry += len(pred_rel_set)
rz += len(gold_rel_set)
rel_f1 = round(2 * rx / (ry + rz), 5) if rx != 1e-10 else 0.0
rel_precision = round(rx / ry, 5) if ry != 1e-10 else 0.0
rel_recall = round(rx / rz, 5) if rz != 1e-10 else 0.0
return {
"entity_f1": ent_f1,
"entity_precision": ent_precision,
"entity_recall": ent_recall,
"relation_f1": rel_f1,
"relation_precision": rel_precision,
"relation_recall": rel_recall,
}
# coding=utf-8
# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import argparse
import os
import time
import paddle
from criterion import Criterion
from evaluate import evaluate
from utils import (
create_dataloader,
criteria_map,
get_label_maps,
reader,
save_model_config,
set_seed,
)
from paddlenlp.datasets import load_dataset
from paddlenlp.layers import (
GlobalPointerForEntityExtraction,
GPLinkerForRelationExtraction,
)
from paddlenlp.transformers import AutoModel, AutoTokenizer, LinearDecayWithWarmup
from paddlenlp.utils.log import logger
def do_train():
paddle.set_device(args.device)
rank = paddle.distributed.get_rank()
if paddle.distributed.get_world_size() > 1:
paddle.distributed.init_parallel_env()
set_seed(args.seed)
label_maps = get_label_maps(args.task_type, args.label_maps_path)
train_ds = load_dataset(reader, data_path=args.train_path, lazy=False)
dev_ds = load_dataset(reader, data_path=args.dev_path, lazy=False)
tokenizer = AutoTokenizer.from_pretrained(args.encoder)
train_dataloader = create_dataloader(
train_ds,
tokenizer,
max_seq_len=args.max_seq_len,
batch_size=args.batch_size,
label_maps=label_maps,
mode="train",
task_type=args.task_type,
)
dev_dataloader = create_dataloader(
dev_ds,
tokenizer,
max_seq_len=args.max_seq_len,
batch_size=args.batch_size,
label_maps=label_maps,
mode="dev",
task_type=args.task_type,
)
encoder = AutoModel.from_pretrained(args.encoder)
if args.task_type == "entity_extraction":
model = GlobalPointerForEntityExtraction(encoder, label_maps)
else:
model = GPLinkerForRelationExtraction(encoder, label_maps)
model_config = {"task_type": args.task_type, "label_maps": label_maps, "encoder": args.encoder}
num_training_steps = len(train_dataloader) * args.num_epochs
lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, args.warmup_proportion)
# Generate parameter names needed to perform weight decay.
# All bias and LayerNorm parameters are excluded.
decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])]
optimizer = paddle.optimizer.AdamW(
learning_rate=lr_scheduler,
parameters=model.parameters(),
weight_decay=args.weight_decay,
apply_decay_param_fun=lambda x: x in decay_params,
)
if args.init_from_ckpt and os.path.isfile(args.init_from_ckpt):
state_dict = paddle.load(args.init_from_ckpt)
model.set_dict(state_dict)
if paddle.distributed.get_world_size() > 1:
model = paddle.DataParallel(model)
criterion = Criterion()
global_step, best_f1 = 1, 0.0
tr_loss, logging_loss = 0.0, 0.0
tic_train = time.time()
for epoch in range(1, args.num_epochs + 1):
for batch in train_dataloader:
input_ids, attention_masks, labels = batch
logits = model(input_ids, attention_masks)
loss = sum([criterion(o, l) for o, l in zip(logits, labels)]) / 3
loss.backward()
tr_loss += loss.item()
lr_scheduler.step()
optimizer.step()
optimizer.clear_grad()
if global_step % args.logging_steps == 0 and rank == 0:
time_diff = time.time() - tic_train
loss_avg = (tr_loss - logging_loss) / args.logging_steps
logger.info(
"global step %d, epoch: %d, loss: %.5f, speed: %.2f step/s"
% (global_step, epoch, loss_avg, args.logging_steps / time_diff)
)
logging_loss = tr_loss
tic_train = time.time()
if global_step % args.eval_steps == 0 and rank == 0:
save_dir = os.path.join(args.save_dir, "model_%d" % global_step)
if not os.path.exists(save_dir):
os.makedirs(save_dir)
save_param_path = os.path.join(save_dir, "model_state.pdparams")
paddle.save(model.state_dict(), save_param_path)
save_model_config(save_dir, model_config)
logger.disable()
tokenizer.save_pretrained(save_dir)
logger.enable()
eval_result = evaluate(model, dev_dataloader, label_maps, task_type=args.task_type)
logger.info("Evaluation precision: " + str(eval_result))
f1 = eval_result[criteria_map[args.task_type]]
if f1 > best_f1:
logger.info(f"best F1 performance has been updated: {best_f1:.5f} --> {f1:.5f}")
best_f1 = f1
save_dir = os.path.join(args.save_dir, "model_best")
if not os.path.exists(save_dir):
os.makedirs(save_dir)
save_param_path = os.path.join(save_dir, "model_state.pdparams")
paddle.save(model.state_dict(), save_param_path)
save_model_config(save_dir, model_config)
logger.disable()
tokenizer.save_pretrained(save_dir)
logger.enable()
tic_train = time.time()
global_step += 1
if __name__ == "__main__":
# yapf: disable
parser = argparse.ArgumentParser()
parser.add_argument("--train_path", default=None, type=str, help="The path of train set.")
parser.add_argument("--dev_path", default=None, type=str, help="The path of dev set.")
parser.add_argument("--batch_size", default=16, type=int, help="Batch size per GPU/CPU for training.")
parser.add_argument("--learning_rate", default=3e-5, type=float, help="The initial learning rate for Adam.")
parser.add_argument("--save_dir", default='./checkpoint', type=str, help="The output directory where the model checkpoints will be written.")
parser.add_argument("--max_seq_len", default=256, type=int, help="The maximum input sequence length.")
parser.add_argument("--label_maps_path", default="./ner_data/label_maps.json", type=str, help="The file path of the labels dictionary.")
parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight decay rate for L2 regularizer.")
parser.add_argument("--warmup_proportion", default=0.0, type=float, help="Linear warmup proportion over the training process.")
parser.add_argument("--num_epochs", default=100, type=int, help="Number of epoches for training.")
parser.add_argument("--seed", default=1000, type=int, help="Random seed for initialization")
parser.add_argument("--encoder", default="ernie-3.0-mini-zh", type=str, help="Select the pretrained encoder model for GP.")
parser.add_argument("--task_type", choices=['relation_extraction', 'event_extraction', 'entity_extraction', 'opinion_extraction'], default="entity_extraction", type=str, help="Select the training task type.")
parser.add_argument("--logging_steps", default=10, type=int, help="The interval steps to logging.")
parser.add_argument("--eval_steps", default=200, type=int, help="The interval steps to evaluate model performance.")
parser.add_argument('--device', choices=['cpu', 'gpu'], default="gpu", help="Select which device to train model, defaults to gpu.")
parser.add_argument("--init_from_ckpt", default=None, type=str, help="The path of model parameters for initialization.")
args = parser.parse_args()
# yapf: enable
do_train()
# coding=utf-8
# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import copy
import json
import os
import random
import numpy as np
import paddle
from data_collator import DataCollator
from paddlenlp.taskflow.utils import SchemaTree
from paddlenlp.utils.log import logger
criteria_map = {
"entity_extraction": "entity_f1",
"opinion_extraction": "relation_f1", # (Aspect, Sentiment, Opinion)
"relation_extraction": "relation_f1", # (Subject, Predicate, Object)
"event_extraction": "relation_f1", # (Trigger, Role, Argument)
}
def set_seed(seed):
paddle.seed(seed)
random.seed(seed)
np.random.seed(seed)
def reader(data_path):
with open(data_path, "r", encoding="utf-8") as f:
for line in f:
json_line = json.loads(line)
yield json_line
def save_model_config(save_dir, model_config):
model_config_file = os.path.join(save_dir, "model_config.json")
with open(model_config_file, "w", encoding="utf-8") as fp:
fp.write(json.dumps(model_config, ensure_ascii=False, indent=2))
def map_offset(ori_offset, offset_mapping):
"""
map ori offset to token offset
"""
for index, span in enumerate(offset_mapping):
if span[0] <= ori_offset < span[1]:
return index
return -1
def get_label_maps(task_type="relation_extraction", label_maps_path=None):
with open(label_maps_path, "r", encoding="utf-8") as fp:
label_maps = json.load(fp)
if task_type == "entity_extraction":
entity2id = label_maps["entity2id"]
id2entity = {idx: t for t, idx in entity2id.items()}
label_maps["id2entity"] = id2entity
else:
entity2id = label_maps["entity2id"]
relation2id = (
label_maps["relation2id"]
if task_type in ["relation_extraction", "event_extraction"]
else label_maps["sentiment2id"]
)
id2entity = {idx: t for t, idx in entity2id.items()}
id2relation = {idx: t for t, idx in relation2id.items()}
label_maps["id2entity"] = id2entity
label_maps["id2relation"] = id2relation
return label_maps
def create_dataloader(
dataset, tokenizer, max_seq_len=128, batch_size=1, label_maps=None, mode="train", task_type="relation_extraction"
):
def tokenize_and_align_train_labels(example):
tokenized_inputs = tokenizer(
example["text"],
max_length=max_seq_len,
padding=False,
truncation=True,
return_attention_mask=True,
return_token_type_ids=False,
return_offsets_mapping=True,
)
offset_mapping = tokenized_inputs["offset_mapping"]
ent_labels = []
for e in example["entity_list"]:
_start, _end = e["start_index"], e["start_index"] + len(e["text"]) - 1
start = map_offset(_start, offset_mapping)
end = map_offset(_end, offset_mapping)
if start == -1 or end == -1:
continue
label = label_maps["entity2id"][e["type"]]
ent_labels.append([label, start, end])
outputs = {
"input_ids": tokenized_inputs["input_ids"],
"attention_mask": tokenized_inputs["attention_mask"],
"labels": {"ent_labels": ent_labels, "rel_labels": []},
}
if task_type in ["relation_extraction", "event_extraction"]:
rel_labels = []
for r in example["spo_list"]:
_sh, _oh = r["subject_start_index"], r["object_start_index"]
_st, _ot = _sh + len(r["subject"]) - 1, _oh + len(r["object"]) - 1
sh = map_offset(_sh, offset_mapping)
st = map_offset(_st, offset_mapping)
oh = map_offset(_oh, offset_mapping)
ot = map_offset(_ot, offset_mapping)
if sh == -1 or st == -1 or oh == -1 or ot == -1:
continue
p = label_maps["relation2id"][r["predicate"]]
rel_labels.append([sh, st, p, oh, ot])
outputs["labels"]["rel_labels"] = rel_labels
elif task_type == "opinion_extraction":
rel_labels = []
for r in example["aso_list"]:
_ah, _oh = r["aspect_start_index"], r["opinion_start_index"]
_at, _ot = _ah + len(r["aspect"]) - 1, _oh + len(r["opinion"]) - 1
ah = map_offset(_ah, offset_mapping)
at = map_offset(_at, offset_mapping)
oh = map_offset(_oh, offset_mapping)
ot = map_offset(_ot, offset_mapping)
if ah == -1 or at == -1 or oh == -1 or ot == -1:
continue
s = label_maps["sentiment2id"][r["sentiment"]]
rel_labels.append([ah, at, s, oh, ot])
outputs["labels"]["rel_labels"] = rel_labels
return outputs
def tokenize(example):
tokenized_inputs = tokenizer(
example["text"],
max_length=max_seq_len,
padding=False,
truncation=True,
return_attention_mask=True,
return_offsets_mapping=True,
return_token_type_ids=False,
)
tokenized_inputs["text"] = example["text"]
return tokenized_inputs
if mode == "train":
dataset = dataset.map(tokenize_and_align_train_labels)
else:
dataset_copy = copy.deepcopy(dataset)
dataset = dataset.map(tokenize)
data_collator = DataCollator(tokenizer, label_maps=label_maps, task_type=task_type)
shuffle = True if mode == "train" else False
batch_sampler = paddle.io.BatchSampler(dataset=dataset, batch_size=batch_size, shuffle=shuffle)
dataloader = paddle.io.DataLoader(
dataset=dataset, batch_sampler=batch_sampler, collate_fn=data_collator, num_workers=0, return_list=True
)
if mode != "train":
dataloader.dataset.raw_data = dataset_copy
return dataloader
def postprocess(batch_outputs, offset_mappings, texts, label_maps, task_type="relation_extraction"):
if task_type == "entity_extraction":
batch_ent_results = []
for entity_output, offset_mapping, text in zip(batch_outputs[0].numpy(), offset_mappings, texts):
entity_output[:, [0, -1]] -= np.inf
entity_output[:, :, [0, -1]] -= np.inf
ent_list = []
for l, start, end in zip(*np.where(entity_output > 0.0)):
start, end = (offset_mapping[start][0], offset_mapping[end][-1])
ent = {"text": text[start:end], "type": label_maps["id2entity"][l], "start_index": start}
ent_list.append(ent)
batch_ent_results.append(ent_list)
return batch_ent_results
else:
batch_ent_results = []
batch_rel_results = []
for entity_output, head_output, tail_output, offset_mapping, text in zip(
batch_outputs[0].numpy(),
batch_outputs[1].numpy(),
batch_outputs[2].numpy(),
offset_mappings,
texts,
):
entity_output[:, [0, -1]] -= np.inf
entity_output[:, :, [0, -1]] -= np.inf
ents = set()
ent_list = []
for l, start, end in zip(*np.where(entity_output > 0.0)):
ents.add((start, end))
start, end = (offset_mapping[start][0], offset_mapping[end][-1])
ent = {"text": text[start:end], "type": label_maps["id2entity"][l], "start_index": start}
ent_list.append(ent)
batch_ent_results.append(ent_list)
rel_list = []
for sh, st in ents:
for oh, ot in ents:
p1s = np.where(head_output[:, sh, oh] > 0.0)[0]
p2s = np.where(tail_output[:, st, ot] > 0.0)[0]
ps = set(p1s) & set(p2s)
for p in ps:
if task_type in ["relation_extraction", "event_extraction"]:
rel = {
"subject": text[offset_mapping[sh][0] : offset_mapping[st][1]],
"predicate": label_maps["id2relation"][p],
"object": text[offset_mapping[oh][0] : offset_mapping[ot][1]],
"subject_start_index": offset_mapping[sh][0],
"object_start_index": offset_mapping[oh][0],
}
else:
rel = {
"aspect": text[offset_mapping[sh][0] : offset_mapping[st][1]],
"sentiment": label_maps["id2relation"][p],
"opinion": text[offset_mapping[oh][0] : offset_mapping[ot][1]],
"aspect_start_index": offset_mapping[sh][0],
"opinion_start_index": offset_mapping[oh][0],
}
rel_list.append(rel)
batch_rel_results.append(rel_list)
return (batch_ent_results, batch_rel_results)
def build_tree(schema, name="root"):
"""
Build the schema tree.
"""
schema_tree = SchemaTree(name)
for s in schema:
if isinstance(s, str):
schema_tree.add_child(SchemaTree(s))
elif isinstance(s, dict):
for k, v in s.items():
if isinstance(v, str):
child = [v]
elif isinstance(v, list):
child = v
else:
raise TypeError(
"Invalid schema, value for each key:value pairs should be list or string"
"but {} received".format(type(v))
)
schema_tree.add_child(build_tree(child, name=k))
else:
raise TypeError("Invalid schema, element should be string or dict, " "but {} received".format(type(s)))
return schema_tree
def schema2label_maps(task_type, schema=None):
if schema and isinstance(schema, dict):
schema = [schema]
label_maps = {}
if task_type == "entity_extraction":
entity2id = {}
for s in schema:
entity2id[s] = len(entity2id)
label_maps["entity2id"] = entity2id
elif task_type == "opinion_extraction":
schema = ["观点词", {"评价维度": ["观点词", "情感倾向[正向,负向]"]}]
logger.info("Opinion extraction does not support custom schema, the schema is default to %s." % schema)
label_maps["entity2id"] = {"评价维度": 0, "观点词": 1}
label_maps["sentiment2id"] = {"正向": 0, "负向": 1}
else:
entity2id = {}
relation2id = {}
schema_tree = build_tree(schema)
schema_list = schema_tree.children[:]
while len(schema_list) > 0:
node = schema_list.pop(0)
if node.name not in entity2id.keys() and len(node.children) != 0:
entity2id[node.name] = len(entity2id)
for child in node.children:
if child.name not in relation2id.keys():
relation2id[child.name] = len(relation2id)
schema_list.append(child)
entity2id["object"] = len(entity2id)
label_maps["entity2id"] = entity2id
label_maps["relation2id"] = relation2id
label_maps["schema"] = schema
return label_maps
def anno2distill(json_lines, task_type, label_maps=None, platform="label_studio"):
if platform == "label_studio":
return label_studio2distill(json_lines, task_type, label_maps)
else:
return doccano2distill(json_lines, task_type, label_maps)
def label_studio2distill(json_lines, task_type, label_maps=None):
"""Convert label-studio to distill format"""
if task_type == "opinion_extraction":
outputs = []
for json_line in json_lines:
id2ent = {}
text = json_line["data"]["text"]
output = {"text": text}
entity_list = []
aso_list = []
annos = json_line["annotations"][0]["result"]
for anno in annos:
if anno["type"] == "labels":
ent_text = text[anno["value"]["start"] : anno["value"]["end"]]
ent_type_gather = anno["value"]["labels"][0].split("##")
if len(ent_type_gather) == 2:
ent_type, ent_senti = ent_type_gather
else:
ent_type = ent_type_gather[0]
ent_senti = None
ent = {"text": ent_text, "type": ent_type, "start_index": anno["value"]["start"]}
id2ent[anno["id"]] = ent
id2ent[anno["id"]]["sentiment"] = ent_senti
entity_list.append(ent)
else:
_aspect = id2ent[anno["from_id"]]
if _aspect["sentiment"]:
_opinion = id2ent[anno["to_id"]]
rel = {
"aspect": _aspect["text"],
"sentiment": _aspect["sentiment"],
"opinion": _opinion["text"],
"aspect_start_index": _aspect["start_index"],
"opinion_start_index": _opinion["start_index"],
}
aso_list.append(rel)
output["aso_list"] = aso_list
output["entity_list"] = entity_list
output["aso_list"] = aso_list
outputs.append(output)
else:
outputs = []
for json_line in json_lines:
id2ent = {}
text = json_line["data"]["text"]
output = {"text": text}
entity_list = []
spo_list = []
annos = json_line["annotations"][0]["result"]
for anno in annos:
if anno["type"] == "labels":
ent_text = text[anno["value"]["start"] : anno["value"]["end"]]
ent_label = anno["value"]["labels"][0]
ent_type = "object" if ent_label not in label_maps["entity2id"].keys() else ent_label
ent = {"text": ent_text, "type": ent_type, "start_index": anno["value"]["start"]}
id2ent[anno["id"]] = ent
entity_list.append(ent)
else:
_subject = id2ent[anno["from_id"]]
_object = id2ent[anno["to_id"]]
rel = {
"subject": _subject["text"],
"predicate": anno["labels"][0],
"object": _object["text"],
"subject_start_index": _subject["start_index"],
"object_start_index": _object["start_index"],
}
spo_list.append(rel)
output["entity_list"] = entity_list
output["spo_list"] = spo_list
outputs.append(output)
return outputs
def doccano2distill(json_lines, task_type, label_maps=None):
"""Convert doccano to distill format"""
if task_type == "opinion_extraction":
outputs = []
for json_line in json_lines:
id2ent = {}
text = json_line["text"]
output = {"text": text}
entity_list = []
entities = json_line["entities"]
for entity in entities:
ent_text = text[entity["start_offset"] : entity["end_offset"]]
ent_type_gather = entity["label"].split("##")
if len(ent_type_gather) == 2:
ent_type, ent_senti = ent_type_gather
else:
ent_type = ent_type_gather[0]
ent_senti = None
ent = {"text": ent_text, "type": ent_type, "start_index": entity["start_offset"]}
id2ent[entity["id"]] = ent
id2ent[entity["id"]]["sentiment"] = ent_senti
entity_list.append(ent)
output["entity_list"] = entity_list
aso_list = []
relations = json_line["relations"]
for relation in relations:
_aspect = id2ent[relation["from_id"]]
if _aspect["sentiment"]:
_opinion = id2ent[relation["to_id"]]
rel = {
"aspect": _aspect["text"],
"sentiment": _aspect["sentiment"],
"opinion": _opinion["text"],
"aspect_start_index": _aspect["start_index"],
"opinion_start_index": _opinion["start_index"],
}
aso_list.append(rel)
output["aso_list"] = aso_list
outputs.append(output)
else:
outputs = []
for json_line in json_lines:
id2ent = {}
text = json_line["text"]
output = {"text": text}
entity_list = []
entities = json_line["entities"]
for entity in entities:
ent_text = text[entity["start_offset"] : entity["end_offset"]]
if entity["label"] not in label_maps["entity2id"].keys():
if task_type == "entity_extraction":
logger.warning(
"Found undefined label type. The setting of schema should contain all the label types in annotation file export from annotation platform."
)
continue
else:
ent_type = "object"
else:
ent_type = entity["label"]
ent = {"text": ent_text, "type": ent_type, "start_index": entity["start_offset"]}
id2ent[entity["id"]] = ent
entity_list.append(ent)
output["entity_list"] = entity_list
spo_list = []
relations = json_line["relations"]
for relation in relations:
_subject = id2ent[relation["from_id"]]
_object = id2ent[relation["to_id"]]
rel = {
"subject": _subject["text"],
"predicate": relation["type"],
"object": _object["text"],
"subject_start_index": _subject["start_index"],
"object_start_index": _object["start_index"],
}
spo_list.append(rel)
output["spo_list"] = spo_list
outputs.append(output)
return outputs
def synthetic2distill(texts, infer_results, task_type, label_maps=None):
"""Convert synthetic data to distill format"""
if task_type == "opinion_extraction":
outputs = []
for i, line in enumerate(infer_results):
pred = line
output = {"text": texts[i]}
entity_list = []
aso_list = []
for key1 in pred.keys():
for s in pred[key1]:
ent = {"text": s["text"], "type": key1, "start_index": s["start"]}
entity_list.append(ent)
if (
"relations" in s.keys()
and "观点词" in s["relations"].keys()
and "情感倾向[正向,负向]" in s["relations"].keys()
):
for o in s["relations"]["观点词"]:
rel = {
"aspect": s["text"],
"sentiment": s["relations"]["情感倾向[正向,负向]"][0]["text"],
"opinion": o["text"],
"aspect_start_index": s["start"],
"opinion_start_index": o["start"],
}
aso_list.append(rel)
ent = {"text": o["text"], "type": "观点词", "start_index": o["start"]}
entity_list.append(ent)
output["entity_list"] = entity_list
output["aso_list"] = aso_list
outputs.append(output)
else:
outputs = []
for i, line in enumerate(infer_results):
pred = line
output = {"text": texts[i]}
entity_list = []
spo_list = []
for key1 in pred.keys():
for s in pred[key1]:
ent = {"text": s["text"], "type": key1, "start_index": s["start"]}
entity_list.append(ent)
if "relations" in s.keys():
for key2 in s["relations"].keys():
for o1 in s["relations"][key2]:
if "start" in o1.keys():
rel = {
"subject": s["text"],
"predicate": key2,
"object": o1["text"],
"subject_start_index": s["start"],
"object_start_index": o1["start"],
}
spo_list.append(rel)
if "relations" not in o1.keys():
ent = {"text": o1["text"], "type": "object", "start_index": o1["start"]}
entity_list.append(ent)
else:
ent = {"text": o1["text"], "type": key2, "start_index": o1["start"]}
entity_list.append(ent)
for key3 in o1["relations"].keys():
for o2 in o1["relations"][key3]:
ent = {
"text": o2["text"],
"type": "object",
"start_index": o2["start"],
}
entity_list.append(ent)
rel = {
"subject": o1["text"],
"predicate": key3,
"object": o2["text"],
"subject_start_index": o1["start"],
"object_start_index": o2["start"],
}
spo_list.append(rel)
output["entity_list"] = entity_list
output["spo_list"] = spo_list
outputs.append(output)
return outputs
# 基于PaddleNLP SimpleServing 的服务化部署
## 目录
- [环境准备](#环境准备)
- [Server服务启动](#Server服务启动)
- [Client请求启动](#Client请求启动)
- [服务化自定义参数](#服务化自定义参数)
## 环境准备
使用有SimpleServing功能的PaddleNLP版本(或者最新的develop版本)
```shell
pip install paddlenlp >= 2.4.4
```
## Server服务启动
```bash
paddlenlp server server:app --workers 1 --host 0.0.0.0 --port 8189
```
## Client请求启动
```bash
python client.py
```
## 服务化自定义参数
### Server 自定义参数
#### schema替换
```python
# Default schema
schema = {"武器名称": ["产国", "类型", "研发单位"]}
```
#### 设置模型路径
```
# Default task_path
uie = Taskflow('information_extraction', task_path='../../checkpoint/model_best/', schema=schema)
```
#### 多卡服务化预测
PaddleNLP SimpleServing 支持多卡负载均衡预测,主要在服务化注册的时候,注册两个Taskflow的task即可,下面是示例代码
```
uie1 = Taskflow('information_extraction', task_path='../../checkpoint/model_best/', schema=schema, device_id=0)
uie2 = Taskflow('information_extraction', task_path='../../checkpoint/model_best/', schema=schema, device_id=1)
service.register_taskflow('uie', [uie1, uie2])
```
### Client 自定义参数
```python
# Changed to input texts you wanted
texts = ['威尔哥(Virgo)减速炸弹是由瑞典FFV军械公司专门为瑞典皇家空军的攻击机实施低空高速轰炸而研制,1956年开始研制,1963年进入服役,装备于A32“矛盾”、A35“龙”、和AJ134“雷”攻击机,主要用于攻击登陆>艇、停放的飞机、高炮、野战火炮、轻型防护装甲车辆以及有生力量。']
```
# Service deployment based on PaddleNLP SimpleServing
- [Environment Preparation](#1)
- [Server](#2)
- [Client](#3)
- [Service Custom Parameters](#4)
<a name="1"></a>
## Environment Preparation
Use the PaddleNLP version with SimpleServing function (or the latest develop version)
```shell
pip install paddlenlp >= 2.4.4
```
<a name="2"></a>
## Server
```bash
paddlenlp server server:app --workers 1 --host 0.0.0.0 --port 8189
```
<a name="3"></a>
## Client
```bash
python client.py
```
<a name="4"></a>
## Service Custom Parameters
### Server Custom Parameters
#### schema replacement
```python
# Default schema
schema = {"Weapon Name": ["Country of Production", "Type", "R&D Unit"]}
```
#### Set model path
```
# Default task_path
uie = Taskflow('information_extraction', task_path='../../checkpoint/model_best/', schema=schema)
```
#### Doka Service Prediction
PaddleNLP SimpleServing supports multi-card load balancing prediction, mainly during service registration, just register two Taskflow tasks, the following is the sample code
```
uie1 = Taskflow('information_extraction', task_path='../../checkpoint/model_best/', schema=schema, device_id=0)
uie2 = Taskflow('information_extraction', task_path='../../checkpoint/model_best/', schema=schema, device_id=1)
service. register_taskflow('uie', [uie1, uie2])
```
### Client Custom Parameters
```python
# Changed to input texts you wanted
texts = ['威尔哥(Virgo)减速炸弹是由瑞典FFV军械公司专门为瑞典皇家空军的攻击机实施低空高速轰炸而研制,1956年开始研制,1963年进入服役,装备于A32“矛盾”、A35“龙”、和AJ134“雷”攻击机,主要用于攻击登陆艇、停放的飞机、高炮、野战火炮、轻型防护装甲车辆以及有生力量。']
```
# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import json
import requests
url = "http://0.0.0.0:8189/taskflow/uie"
headers = {"Content-Type": "application/json"}
texts = [
"威尔哥(Virgo)减速炸弹是由瑞典FFV军械公司专门为瑞典皇家空军的攻击机实施低空高速轰炸而研制,1956年开始研制,1963年进入服役,装备于A32“矛盾”、A35“龙”、和AJ134“雷”攻击机,主要用于攻击登陆>艇、停放的飞机、高炮、野战火炮、轻型防护装甲车辆以及有生力量。"
]
data = {"data": {"text": texts}}
r = requests.post(url=url, headers=headers, data=json.dumps(data))
datas = json.loads(r.text)
print(datas)
# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from paddlenlp import SimpleServer, Taskflow
# The schema changed to your defined schema
schema = {"武器名称": ["产国", "类型", "研发单位"]}
# The task path changed to your best model path
uie = Taskflow("information_extraction", schema=schema, task_path="../../checkpoint/model_best/")
# If you want to define the finetuned uie service
app = SimpleServer()
app.register_taskflow("taskflow/uie", uie)
# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import argparse
from functools import partial
import paddle
from utils import convert_example, create_data_loader, reader
from paddlenlp.data import DataCollatorWithPadding
from paddlenlp.datasets import MapDataset, load_dataset
from paddlenlp.metrics import SpanEvaluator
from paddlenlp.transformers import UIE, UIEM, AutoTokenizer
from paddlenlp.utils.ie_utils import get_relation_type_dict, unify_prompt_name
from paddlenlp.utils.log import logger
@paddle.no_grad()
def evaluate(model, metric, data_loader, multilingual=False):
"""
Given a dataset, it evals model and computes the metric.
Args:
model(obj:`paddle.nn.Layer`): A model to classify texts.
metric(obj:`paddle.metric.Metric`): The evaluation metric.
data_loader(obj:`paddle.io.DataLoader`): The dataset loader which generates batches.
multilingual(bool): Whether is the multilingual model.
"""
model.eval()
metric.reset()
for batch in data_loader:
if multilingual:
start_prob, end_prob = model(batch["input_ids"], batch["position_ids"])
else:
start_prob, end_prob = model(
batch["input_ids"], batch["token_type_ids"], batch["position_ids"], batch["attention_mask"]
)
start_ids = paddle.cast(batch["start_positions"], "float32")
end_ids = paddle.cast(batch["end_positions"], "float32")
num_correct, num_infer, num_label = metric.compute(start_prob, end_prob, start_ids, end_ids)
metric.update(num_correct, num_infer, num_label)
precision, recall, f1 = metric.accumulate()
model.train()
return precision, recall, f1
def do_eval():
paddle.set_device(args.device)
if args.model_path in ["uie-m-base", "uie-m-large"]:
args.multilingual = True
tokenizer = AutoTokenizer.from_pretrained(args.model_path)
if args.multilingual:
model = UIEM.from_pretrained(args.model_path)
else:
model = UIE.from_pretrained(args.model_path)
test_ds = load_dataset(reader, data_path=args.test_path, max_seq_len=args.max_seq_len, lazy=False)
class_dict = {}
relation_data = []
if args.debug:
for data in test_ds:
class_name = unify_prompt_name(data["prompt"])
# Only positive examples are evaluated in debug mode
if len(data["result_list"]) != 0:
p = "的" if args.schema_lang == "ch" else " of "
if p not in data["prompt"]:
class_dict.setdefault(class_name, []).append(data)
else:
relation_data.append((data["prompt"], data))
relation_type_dict = get_relation_type_dict(relation_data, schema_lang=args.schema_lang)
else:
class_dict["all_classes"] = test_ds
trans_fn = partial(
convert_example, tokenizer=tokenizer, max_seq_len=args.max_seq_len, multilingual=args.multilingual
)
for key in class_dict.keys():
if args.debug:
test_ds = MapDataset(class_dict[key])
else:
test_ds = class_dict[key]
test_ds = test_ds.map(trans_fn)
data_collator = DataCollatorWithPadding(tokenizer)
test_data_loader = create_data_loader(test_ds, mode="test", batch_size=args.batch_size, trans_fn=data_collator)
metric = SpanEvaluator()
precision, recall, f1 = evaluate(model, metric, test_data_loader, args.multilingual)
logger.info("-----------------------------")
logger.info("Class Name: %s" % key)
logger.info("Evaluation Precision: %.5f | Recall: %.5f | F1: %.5f" % (precision, recall, f1))
if args.debug and len(relation_type_dict.keys()) != 0:
for key in relation_type_dict.keys():
test_ds = MapDataset(relation_type_dict[key])
test_ds = test_ds.map(trans_fn)
test_data_loader = create_data_loader(
test_ds, mode="test", batch_size=args.batch_size, trans_fn=data_collator
)
metric = SpanEvaluator()
precision, recall, f1 = evaluate(model, metric, test_data_loader)
logger.info("-----------------------------")
if args.schema_lang == "ch":
logger.info("Class Name: X的%s" % key)
else:
logger.info("Class Name: %s of X" % key)
logger.info("Evaluation Precision: %.5f | Recall: %.5f | F1: %.5f" % (precision, recall, f1))
if __name__ == "__main__":
# yapf: disable
parser = argparse.ArgumentParser()
parser.add_argument("--model_path", type=str, default=None, help="The path of saved model that you want to load.")
parser.add_argument("--test_path", type=str, default=None, help="The path of test set.")
parser.add_argument("--batch_size", type=int, default=16, help="Batch size per GPU/CPU for training.")
parser.add_argument("--device", type=str, default="gpu", choices=["gpu", "cpu", "npu"], help="Device selected for evaluate.")
parser.add_argument("--max_seq_len", type=int, default=512, help="The maximum total input sequence length after tokenization.")
parser.add_argument("--debug", action='store_true', help="Precision, recall and F1 score are calculated for each class separately if this option is enabled.")
parser.add_argument("--multilingual", action='store_true', help="Whether is the multilingual model.")
parser.add_argument("--schema_lang", choices=["ch", "en"], default="ch", help="Select the language type for schema.")
args = parser.parse_args()
# yapf: enable
do_eval()
# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import json
import os
from dataclasses import dataclass, field
from functools import partial
from typing import List, Optional
import paddle
from utils import convert_example, reader
from paddlenlp.data import DataCollatorWithPadding
from paddlenlp.datasets import load_dataset
from paddlenlp.metrics import SpanEvaluator
from paddlenlp.trainer import (
CompressionArguments,
PdArgumentParser,
Trainer,
get_last_checkpoint,
)
from paddlenlp.transformers import UIE, UIEM, AutoTokenizer, export_model
from paddlenlp.utils.ie_utils import compute_metrics, uie_loss_func
from paddlenlp.utils.log import logger
@dataclass
class DataArguments:
"""
Arguments pertaining to what data we are going to input our model for training and eval.
Using `PdArgumentParser` we can turn this class into argparse arguments to be able to
specify them on the command line.
"""
train_path: str = field(
default=None, metadata={"help": "The name of the dataset to use (via the datasets library)."}
)
dev_path: str = field(
default=None, metadata={"help": "The name of the dataset to use (via the datasets library)."}
)
max_seq_length: Optional[int] = field(
default=512,
metadata={
"help": "The maximum total input sequence length after tokenization. Sequences longer "
"than this will be truncated, sequences shorter will be padded."
},
)
dynamic_max_length: Optional[List[int]] = field(
default=None,
metadata={"help": "dynamic max length from batch, it can be array of length, eg: 16 32 64 128"},
)
@dataclass
class ModelArguments:
"""
Arguments pertaining to which model/config/tokenizer we are going to fine-tune from.
"""
model_name_or_path: Optional[str] = field(
default="uie-base",
metadata={
"help": "Path to pretrained model, such as 'uie-base', 'uie-tiny', "
"'uie-medium', 'uie-mini', 'uie-micro', 'uie-nano', 'uie-base-en', "
"'uie-m-base', 'uie-m-large', or finetuned model path."
},
)
export_model_dir: Optional[str] = field(
default=None,
metadata={"help": "Path to directory to store the exported inference model."},
)
multilingual: bool = field(default=False, metadata={"help": "Whether the model is a multilingual model."})
def main():
parser = PdArgumentParser((ModelArguments, DataArguments, CompressionArguments))
model_args, data_args, training_args = parser.parse_args_into_dataclasses()
training_args.label_names = ["start_positions", "end_positions"]
if model_args.model_name_or_path in ["uie-m-base", "uie-m-large"]:
model_args.multilingual = True
elif os.path.exists(os.path.join(model_args.model_name_or_path, "model_config.json")):
with open(os.path.join(model_args.model_name_or_path, "model_config.json")) as f:
init_class = json.load(f)["init_class"]
if init_class == "UIEM":
model_args.multilingual = True
# Log model and data config
training_args.print_config(model_args, "Model")
training_args.print_config(data_args, "Data")
paddle.set_device(training_args.device)
# Log on each process the small summary:
logger.warning(
f"Process rank: {training_args.local_rank}, device: {training_args.device}, world_size: {training_args.world_size}, "
+ f"distributed training: {bool(training_args.local_rank != -1)}, 16-bits training: {training_args.fp16}"
)
# Detecting last checkpoint.
last_checkpoint = None
if os.path.isdir(training_args.output_dir) and training_args.do_train and not training_args.overwrite_output_dir:
last_checkpoint = get_last_checkpoint(training_args.output_dir)
if last_checkpoint is None and len(os.listdir(training_args.output_dir)) > 0:
raise ValueError(
f"Output directory ({training_args.output_dir}) already exists and is not empty. "
"Use --overwrite_output_dir to overcome."
)
elif last_checkpoint is not None and training_args.resume_from_checkpoint is None:
logger.info(
f"Checkpoint detected, resuming training at {last_checkpoint}. To avoid this behavior, change "
"the `--output_dir` or add `--overwrite_output_dir` to train from scratch."
)
tokenizer = AutoTokenizer.from_pretrained(model_args.model_name_or_path)
if model_args.multilingual:
model = UIEM.from_pretrained(model_args.model_name_or_path)
else:
model = UIE.from_pretrained(model_args.model_name_or_path)
train_ds = load_dataset(reader, data_path=data_args.train_path, max_seq_len=data_args.max_seq_length, lazy=False)
dev_ds = load_dataset(reader, data_path=data_args.dev_path, max_seq_len=data_args.max_seq_length, lazy=False)
trans_fn = partial(
convert_example,
tokenizer=tokenizer,
max_seq_len=data_args.max_seq_length,
multilingual=model_args.multilingual,
dynamic_max_length=data_args.dynamic_max_length,
)
train_ds = train_ds.map(trans_fn)
dev_ds = dev_ds.map(trans_fn)
if training_args.device == "npu":
data_collator = DataCollatorWithPadding(tokenizer, padding="longest")
else:
data_collator = DataCollatorWithPadding(tokenizer)
trainer = Trainer(
model=model,
criterion=uie_loss_func,
args=training_args,
data_collator=data_collator,
train_dataset=train_ds if training_args.do_train or training_args.do_compress else None,
eval_dataset=dev_ds if training_args.do_eval or training_args.do_compress else None,
tokenizer=tokenizer,
compute_metrics=compute_metrics,
)
trainer.optimizer = paddle.optimizer.AdamW(
learning_rate=training_args.learning_rate, parameters=model.parameters()
)
checkpoint = None
if training_args.resume_from_checkpoint is not None:
checkpoint = training_args.resume_from_checkpoint
elif last_checkpoint is not None:
checkpoint = last_checkpoint
# Training
if training_args.do_train:
train_result = trainer.train(resume_from_checkpoint=checkpoint)
metrics = train_result.metrics
trainer.save_model()
trainer.log_metrics("train", metrics)
trainer.save_metrics("train", metrics)
trainer.save_state()
# Evaluate and tests model
if training_args.do_eval:
eval_metrics = trainer.evaluate()
trainer.log_metrics("eval", eval_metrics)
# export inference model
if training_args.do_export:
# You can also load from certain checkpoint
# trainer.load_state_dict_from_checkpoint("/path/to/checkpoint/")
if training_args.device == "npu":
# npu will transform int64 to int32 for internal calculation.
# To reduce useless transformation, we feed int32 inputs.
input_spec_dtype = "int32"
else:
input_spec_dtype = "int64"
if model_args.multilingual:
input_spec = [
paddle.static.InputSpec(shape=[None, None], dtype=input_spec_dtype, name="input_ids"),
paddle.static.InputSpec(shape=[None, None], dtype=input_spec_dtype, name="position_ids"),
]
else:
input_spec = [
paddle.static.InputSpec(shape=[None, None], dtype=input_spec_dtype, name="input_ids"),
paddle.static.InputSpec(shape=[None, None], dtype=input_spec_dtype, name="token_type_ids"),
paddle.static.InputSpec(shape=[None, None], dtype=input_spec_dtype, name="position_ids"),
paddle.static.InputSpec(shape=[None, None], dtype=input_spec_dtype, name="attention_mask"),
]
if model_args.export_model_dir is None:
model_args.export_model_dir = os.path.join(training_args.output_dir, "export")
export_model(model=trainer.model, input_spec=input_spec, path=model_args.export_model_dir)
if training_args.do_compress:
@paddle.no_grad()
def custom_evaluate(self, model, data_loader):
metric = SpanEvaluator()
model.eval()
metric.reset()
for batch in data_loader:
if model_args.multilingual:
logits = model(input_ids=batch["input_ids"], position_ids=batch["position_ids"])
else:
logits = model(
input_ids=batch["input_ids"],
token_type_ids=batch["token_type_ids"],
position_ids=batch["position_ids"],
attention_mask=batch["attention_mask"],
)
start_prob, end_prob = logits
start_ids, end_ids = batch["start_positions"], batch["end_positions"]
num_correct, num_infer, num_label = metric.compute(start_prob, end_prob, start_ids, end_ids)
metric.update(num_correct, num_infer, num_label)
precision, recall, f1 = metric.accumulate()
logger.info("f1: %s, precision: %s, recall: %s" % (f1, precision, f1))
model.train()
return f1
trainer.compress(custom_evaluate=custom_evaluate)
if __name__ == "__main__":
main()
# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import json
import random
from typing import List, Optional
import numpy as np
import paddle
from paddlenlp.utils.log import logger
def set_seed(seed):
paddle.seed(seed)
random.seed(seed)
np.random.seed(seed)
def create_data_loader(dataset, mode="train", batch_size=1, trans_fn=None):
"""
Create dataloader.
Args:
dataset(obj:`paddle.io.Dataset`): Dataset instance.
mode(obj:`str`, optional, defaults to obj:`train`): If mode is 'train', it will shuffle the dataset randomly.
batch_size(obj:`int`, optional, defaults to 1): The sample number of a mini-batch.
trans_fn(obj:`callable`, optional, defaults to `None`): function to convert a data sample to input ids, etc.
Returns:
dataloader(obj:`paddle.io.DataLoader`): The dataloader which generates batches.
"""
if trans_fn:
dataset = dataset.map(trans_fn)
shuffle = True if mode == "train" else False
if mode == "train":
sampler = paddle.io.DistributedBatchSampler(dataset=dataset, batch_size=batch_size, shuffle=shuffle)
else:
sampler = paddle.io.BatchSampler(dataset=dataset, batch_size=batch_size, shuffle=shuffle)
dataloader = paddle.io.DataLoader(dataset, batch_sampler=sampler, return_list=True)
return dataloader
def map_offset(ori_offset, offset_mapping):
"""
map ori offset to token offset
"""
for index, span in enumerate(offset_mapping):
if span[0] <= ori_offset < span[1]:
return index
return -1
def reader(data_path, max_seq_len=512):
"""
read json
"""
with open(data_path, "r", encoding="utf-8") as f:
for line in f:
json_line = json.loads(line)
content = json_line["content"].strip()
prompt = json_line["prompt"]
# Model Input is aslike: [CLS] Prompt [SEP] Content [SEP]
# It include three summary tokens.
if max_seq_len <= len(prompt) + 3:
raise ValueError("The value of max_seq_len is too small, please set a larger value")
max_content_len = max_seq_len - len(prompt) - 3
if len(content) <= max_content_len:
yield json_line
else:
result_list = json_line["result_list"]
json_lines = []
accumulate = 0
while True:
cur_result_list = []
for result in result_list:
if result["end"] - result["start"] > max_content_len:
logger.warning(
"result['end'] - result ['start'] exceeds max_content_len, which will result in no valid instance being returned"
)
if (
result["start"] + 1 <= max_content_len < result["end"]
and result["end"] - result["start"] <= max_content_len
):
max_content_len = result["start"]
break
cur_content = content[:max_content_len]
res_content = content[max_content_len:]
while True:
if len(result_list) == 0:
break
elif result_list[0]["end"] <= max_content_len:
if result_list[0]["end"] > 0:
cur_result = result_list.pop(0)
cur_result_list.append(cur_result)
else:
cur_result_list = [result for result in result_list]
break
else:
break
json_line = {"content": cur_content, "result_list": cur_result_list, "prompt": prompt}
json_lines.append(json_line)
for result in result_list:
if result["end"] <= 0:
break
result["start"] -= max_content_len
result["end"] -= max_content_len
accumulate += max_content_len
max_content_len = max_seq_len - len(prompt) - 3
if len(res_content) == 0:
break
elif len(res_content) < max_content_len:
json_line = {"content": res_content, "result_list": result_list, "prompt": prompt}
json_lines.append(json_line)
break
else:
content = res_content
for json_line in json_lines:
yield json_line
def get_dynamic_max_length(examples, default_max_length: int, dynamic_max_length: List[int]) -> int:
"""get max_length by examples which you can change it by examples in batch"""
cur_length = len(examples[0]["input_ids"])
max_length = default_max_length
for max_length_option in sorted(dynamic_max_length):
if cur_length <= max_length_option:
max_length = max_length_option
break
return max_length
def convert_example(
example, tokenizer, max_seq_len, multilingual=False, dynamic_max_length: Optional[List[int]] = None
):
"""
example: {
title
prompt
content
result_list
}
"""
if dynamic_max_length is not None:
temp_encoded_inputs = tokenizer(
text=[example["prompt"]],
text_pair=[example["content"]],
truncation=True,
max_seq_len=max_seq_len,
return_attention_mask=True,
return_position_ids=True,
return_dict=False,
return_offsets_mapping=True,
)
max_length = get_dynamic_max_length(
examples=temp_encoded_inputs, default_max_length=max_seq_len, dynamic_max_length=dynamic_max_length
)
# always pad to max_length
encoded_inputs = tokenizer(
text=[example["prompt"]],
text_pair=[example["content"]],
truncation=True,
max_seq_len=max_length,
pad_to_max_seq_len=True,
return_attention_mask=True,
return_position_ids=True,
return_dict=False,
return_offsets_mapping=True,
)
start_ids = [0.0 for x in range(max_length)]
end_ids = [0.0 for x in range(max_length)]
else:
encoded_inputs = tokenizer(
text=[example["prompt"]],
text_pair=[example["content"]],
truncation=True,
max_seq_len=max_seq_len,
pad_to_max_seq_len=True,
return_attention_mask=True,
return_position_ids=True,
return_dict=False,
return_offsets_mapping=True,
)
start_ids = [0.0 for x in range(max_seq_len)]
end_ids = [0.0 for x in range(max_seq_len)]
encoded_inputs = encoded_inputs[0]
offset_mapping = [list(x) for x in encoded_inputs["offset_mapping"]]
bias = 0
for index in range(1, len(offset_mapping)):
mapping = offset_mapping[index]
if mapping[0] == 0 and mapping[1] == 0 and bias == 0:
bias = offset_mapping[index - 1][1] + 1 # Includes [SEP] token
if mapping[0] == 0 and mapping[1] == 0:
continue
offset_mapping[index][0] += bias
offset_mapping[index][1] += bias
for item in example["result_list"]:
start = map_offset(item["start"] + bias, offset_mapping)
end = map_offset(item["end"] - 1 + bias, offset_mapping)
start_ids[start] = 1.0
end_ids[end] = 1.0
if multilingual:
tokenized_output = {
"input_ids": encoded_inputs["input_ids"],
"position_ids": encoded_inputs["position_ids"],
"start_positions": start_ids,
"end_positions": end_ids,
}
else:
tokenized_output = {
"input_ids": encoded_inputs["input_ids"],
"token_type_ids": encoded_inputs["token_type_ids"],
"position_ids": encoded_inputs["position_ids"],
"attention_mask": encoded_inputs["attention_mask"],
"start_positions": start_ids,
"end_positions": end_ids,
}
return tokenized_output
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment