"tests/rag/test_modeling_rag.py" did not exist on "06971ac4f955072566b635e4f3e22cadc74be1a1"
Commit 10f294ff authored by yuguo-Jack's avatar yuguo-Jack
Browse files

llama_paddle

parent 7c64e6ec
Pipeline #678 failed with stages
in 0 seconds
# document information extraction
**Table of contents**
- [1. Introduction](#1)
- [2. Quick Start](#2)
- [2.1 Code Structure](#21)
- [2.2 Data Annotation](#22)
- [2.3 Finetuning](#23)
- [2.4 Evaluation](#24)
- [2.5 Inference](#25)
- [2.6 Experiments](#26)
<a name="1"></a>
## 1. Introduction
This Information Extraction (IE) guide introduces our open-source industry-grade solution that covers the most widely-used application scenarios of Information Extraction. It features **multi-domain, multi-task, and cross-modal capabilities** and goes through the full lifecycle of **data labeling, model training and model deployment**. We hope this guide can help you apply Information Extraction techniques in your own products or models.
Information Extraction (IE) is the process of extracting structured information from given input data such as text, pictures or scanned document. While IE brings immense value, applying IE techniques is never easy with challenges such as domain adaptation, heterogeneous structures, lack of labeled data, etc. This PaddleNLP Information Extraction Guide builds on the foundation of our work in [Universal Information Extraction](https://arxiv.org/abs/2203.12277) and provides an industrial-level solution that not only supports **extracting entities, relations, events and opinions from plain text**, but also supports **cross-modal extraction out of documents, tables and pictures.** Our method features a flexible prompt, which allows you to specify extraction targets with simple natural language. We also provide a few different domain-adapted models specialized for different industry sectors.
**Highlights:**
- **Comprehensive Coverage🎓:** Covers various mainstream tasks of information extraction for plain text and document scenarios, supports multiple languages
- **State-of-the-Art Performance🏃:** Strong performance from the UIE model series models in plain text and multimodal datasets. We also provide pretrained models of various sizes to meet different needs
- **Easy to use⚡:** three lines of code to use our `Taskflow` for out-of-box Information Extraction capabilities. One line of command to model training and model deployment
- **Efficient Tuning✊:** Developers can easily get started with the data labeling and model training process without a background in Machine Learning.
<a name="2"></a>
## 2. Quick Start
For quick start, you can directly use ```paddlenlp.Taskflow``` out-of-the-box, leveraging the zero-shot capability. For production use cases, we recommend labeling a small amount of data for model fine-tuning to further improve the performance.
<a name="21"></a>
### 2.1 Code Structure
```shell
.
├── utils.py # data processing tools
├── finetune.py # model fine-tuning, compression script
├── evaluate.py # model evaluation script
└── README.md
```
<a name="22"></a>
### 2.2 Data Annotation
We recommend using [Label Studio](https://labelstud.io/) for data labeling. We provide an end-to-end pipeline for the labeling -> training process. You can export the labeled data in Label Studio through [label_studio.py](../label_studio.py) script to export and convert the data into the required input form for the model. For a detailed introduction to labeling methods, please refer to [Label Studio Data Labeling Guide](../label_studio_doc_en.md).
Here we provide the pre-labeled example dataset `VAT invoice dataset`, which you can download by running the following command. We will demonstrate how to use the data conversion script to generate training/validation/test set files for finetuning.
Download the VAT invoice dataset:
```shell
wget https://paddlenlp.bj.bcebos.com/datasets/tax.tar.gz
tar -zxvf tax.tar.gz
mv tax data
rm tax.tar.gz
```
Generate training/validation data files:
```shell
python ../label_studio.py \
--label_studio_file ./data/label_studio.json \
--save_dir ./data \
--splits 0.8 0.2 0 \
--task_type ext
```
Generate training/validation set files, you can use PP-Structure's layout analysis to optimize the sorting of OCR results:
```shell
python ../label_studio.py \
--label_studio_file ./data/label_studio.json \
--save_dir ./data \
--splits 0.8 0.2 0\
--task_type ext\
--layout_analysis True
```
For more labeling rules and parameter descriptions for different types of tasks (including entity extraction, relationship extraction, document classification, etc.), please refer to [Label Studio Data Labeling Guide](../label_studio_doc_en.md).
<a name="23"></a>
### 2.3 Finetuning
Use the following command to fine-tune the model using `uie-x-base` as the pre-trained model, and save the fine-tuned model to `./checkpoint/model_best`:
Single GPU:
```shell
python finetune.py\
--device gpu \
--logging_steps 5 \
--save_steps 25 \
--eval_steps 25 \
--seed 42 \
--model_name_or_path uie-x-base \
--output_dir ./checkpoint/model_best\
--train_path data/train.txt \
--dev_path data/dev.txt \
--max_seq_len 512 \
--per_device_train_batch_size 8 \
--per_device_eval_batch_size 8 \
--num_train_epochs 10 \
--learning_rate 1e-5 \
--do_train \
--do_eval \
--do_export \
--export_model_dir ./checkpoint/model_best\
--overwrite_output_dir \
--disable_tqdm True \
--metric_for_best_model eval_f1 \
--load_best_model_at_end True \
--save_total_limit 1
```
Multiple GPUs:
```shell
python -u -m paddle.distributed.launch --gpus "0" finetune.py \
--device gpu \
--logging_steps 5 \
--save_steps 25 \
--eval_steps 25 \
--seed 42 \
--model_name_or_path uie-x-base \
--output_dir ./checkpoint/model_best\
--train_path data/train.txt \
--dev_path data/dev.txt \
--max_seq_len 512 \
--per_device_train_batch_size 8 \
--per_device_eval_batch_size 8 \
--num_train_epochs 10 \
--learning_rate 1e-5 \
--do_train \
--do_eval \
--do_export \
--export_model_dir ./checkpoint/model_best\
--overwrite_output_dir \
--disable_tqdm True \
--metric_for_best_model eval_f1 \
--load_best_model_at_end True \
--save_total_limit 1
```
Since the parameter `--do_eval` is set in the sample code, it will be automatically evaluated after training.
Parameters:
* `device`: Training device, one of 'cpu', 'gpu' and 'npu' can be selected; the default is GPU training.
* `logging_steps`: The interval steps of log printing during training, the default is 10.
* `save_steps`: The number of interval steps to save the model checkpoint during training, the default is 100.
* `eval_steps`: The number of interval steps to save the model checkpoint during training, the default is 100.
* `seed`: global random seed, default is 42.
* `model_name_or_path`: The pre-trained model used for few shot training. Defaults to "uie-x-base".
* `output_dir`: required, the model directory saved after model training or compression; the default is `None`.
* `train_path`: training set path; defaults to `None`.
* `dev_path`: Development set path; defaults to `None`.
* `max_seq_len`: The maximum segmentation length of the text. When the input exceeds the maximum length, the input text will be automatically segmented. The default is 512.
* `per_device_train_batch_size`: The batch size of each GPU core/NPU core/CPU used for training, the default is 8.
* `per_device_eval_batch_size`: Batch size per GPU core/NPU core/CPU for evaluation, default is 8.
* `num_train_epochs`: Training rounds, 100 can be selected when using early stopping method; the default is 10.
* `learning_rate`: The maximum learning rate for training, UIE-X recommends setting it to 1e-5; the default value is 3e-5.
* `label_names`: the name of the training data label label, UIE-X is set to 'start_positions' 'end_positions'; the default value is None.
* `do_train`: Whether to perform fine-tuning training, setting this parameter means to perform fine-tuning training, and it is not set by default.
* `do_eval`: Whether to evaluate, setting this parameter means to evaluate, the default is not set.
* `do_export`: Whether to export, setting this parameter means to export static images, and it is not set by default.
* `export_model_dir`: Static map export address, the default is None.
* `overwrite_output_dir`: If `True`, overwrite the contents of the output directory. If `output_dir` points to a checkpoint directory, use it to continue training.
* `disable_tqdm`: Whether to use tqdm progress bar.
* `metric_for_best_model`: Optimal model metric, UIE-X recommends setting it to `eval_f1`, the default is None.
* `load_best_model_at_end`: Whether to load the best model after training, usually used in conjunction with `metric_for_best_model`, the default is False.
* `save_total_limit`: If this parameter is set, the total number of checkpoints will be limited. Remove old checkpoints `output directory`, defaults to None.
<a name="24"></a>
### 2.4 Evaluation
```shell
python evaluate.py \
--device "gpu" \
--model_path ./checkpoint/model_best \
--test_path ./data/dev.txt \
--output_dir ./checkpoint/model_best \
--label_names 'start_positions' 'end_positions'\
--max_seq_len 512 \
--per_device_eval_batch_size 16
```
We adopt the single-stage method for evaluation, which means tasks that require multiple stages (e.g. relation extraction, event extraction) are evaluated separately for each stage. By default, the validation/test set uses all labels at the same level to construct the negative examples.
The `debug` mode can be turned on to evaluate each positive category separately. This mode is only used for model debugging:
```shell
python evaluate.py \
--device "gpu" \
--model_path ./checkpoint/model_best \
--test_path ./data/dev.txt \
--output_dir ./checkpoint/model_best \
--label_names 'start_positions' 'end_positions' \
--max_seq_len 512 \
--per_device_eval_batch_size 16 \
--debug True
```
Output result:
```text
[2022-11-14 09:41:18,424] [ INFO] - ***** Running Evaluation *****
[2022-11-14 09:41:18,424] [ INFO] - Num examples = 160
[2022-11-14 09:41:18,424] [ INFO] - Pre device batch size = 4
[2022-11-14 09:41:18,424] [ INFO] - Total Batch size = 4
[2022-11-14 09:41:18,424] [ INFO] - Total prediction steps = 40
[2022-11-14 09:41:26,451] [ INFO] - -----Evaluate model-------
[2022-11-14 09:41:26,451] [ INFO] - Class Name: ALL CLASSES
[2022-11-14 09:41:26,451] [ INFO] - Evaluation Precision: 0.94521 | Recall: 0.88462 | F1: 0.91391
[2022-11-14 09:41:26,451] [ INFO] - -----------------------------
[2022-11-14 09:41:26,452] [ INFO] - ***** Running Evaluation *****
[2022-11-14 09:41:26,452] [ INFO] - Num examples = 8
[2022-11-14 09:41:26,452] [ INFO] - Pre device batch size = 4
[2022-11-14 09:41:26,452] [ INFO] - Total Batch size = 4
[2022-11-14 09:41:26,452] [ INFO] - Total prediction steps = 2
[2022-11-14 09:41:26,692] [ INFO] - Class Name: 开票日期
[2022-11-14 09:41:26,692] [ INFO] - Evaluation Precision: 1.00000 | Recall: 1.00000 | F1: 1.00000
[2022-11-14 09:41:26,692] [ INFO] - -----------------------------
[2022-11-14 09:41:26,693] [ INFO] - ***** Running Evaluation *****
[2022-11-14 09:41:26,693] [ INFO] - Num examples = 8
[2022-11-14 09:41:26,693] [ INFO] - Pre device batch size = 4
[2022-11-14 09:41:26,693] [ INFO] - Total Batch size = 4
[2022-11-14 09:41:26,693] [ INFO] - Total prediction steps = 2
[2022-11-14 09:41:26,952] [ INFO] - Class Name: 名称
[2022-11-14 09:41:26,952] [ INFO] - Evaluation Precision: 0.87500 | Recall: 0.87500 | F1: 0.87500
[2022-11-14 09:41:26,952] [ INFO] - -----------------------------
...
```
Parameters:
* `device`: Evaluation device, one of 'cpu', 'gpu' and 'npu' can be selected; the default is GPU evaluation.
* `model_path`: The path of the model folder for evaluation, which must contain the model weight file `model_state.pdparams` and the configuration file `model_config.json`.
* `test_path`: The test set file for evaluation.
* `label_names`: the name of the training data label, UIE-X is set to 'start_positions' 'end_positions'; the default value is None.
* `batch_size`: batch size, please adjust according to the machine situation, the default is 16.
* `max_seq_len`: The maximum segmentation length of the text. When the input exceeds the maximum length, the input text will be automatically segmented. The default is 512.
* `per_device_eval_batch_size`: Batch size per GPU core/NPU core/CPU for evaluation, default is 8.
* `debug`: Whether to enable the debug mode to evaluate each positive category separately. This mode is only used for model debugging and is disabled by default.
* `schema_lang`: Select the language of the schema, optional `ch` and `en`. The default is `ch`, please select `en` for the English dataset.
<a name="25"></a>
### 2.5 Inference
Same with the pretrained models, you can use `paddlenlp.Taskflow` to load your custom model by specifying the path of the model weight file through `task_path`
```python
from pprint import pprint
from paddlenlp import Taskflow
from paddlenlp.utils.doc_parser import DocParser
schema = ['开票日期', '名称', '纳税人识别号', '开户行及账号', '金额', '价税合计', 'No', '税率', '地址、电话', '税额']
my_ie = Taskflow("information_extraction", model="uie-x-base", schema=schema, task_path='./checkpoint/model_best', precision='fp16')
```
We specify the extraction targets by setting `schema` and visualize the information of the specified `doc_path` document:
```python
doc_path = "./data/images/b199.jpg"
results = my_ie({"doc": doc_path})
pprint(results)
# Result visualization
DocParser.write_image_with_results(
doc_path,
result=results[0],
save_path="./image_show.png")
```
<div align="center">
<img src=https://user-images.githubusercontent.com/40840292/206084942-44ba477c-9244-4ce2-bbb5-ba430c9b926e.png height=550 width=700 />
</div>
<a name="26"></a>
### 2.6 Experiments
| | Precision | Recall | F1 Score |
| :---: | :--------: | :--------: | :--------: |
| 0-shot| 0.44898 | 0.56410 | 0.50000 |
| 5-shot| 0.9000 | 0.9231 | 0.9114 |
| 10-shot| 0.9125 | 0.93590 | 0.9241 |
| 20-shot| 0.9737 | 0.9487 | 0.9610 |
| 30-shot| 0.9744 | 0.9744 | 0.9744 |
| 30-shot+PP-Structure| 1.0 | 0.9625 | 0.9809 |
n-shot means that the training set contains n labeled image data for model fine-tuning. Experiments show that UIE-X can further improve the results through a small amount of data (few-shot) and PP-Structure layout analysis.
# 基于PaddleNLP SimpleServing 的服务化部署
## 目录
- [环境准备](#环境准备)
- [Server服务启动](#Server服务启动)
- [Client请求启动](#Client请求启动)
- [服务化自定义参数](#服务化自定义参数)
## 环境准备
使用有SimpleServing功能的PaddleNLP版本(或者最新的develop版本)
```shell
pip install paddlenlp >= 2.4.4
```
## Server服务启动
```bash
paddlenlp server server:app --workers 1 --host 0.0.0.0 --port 8189
```
## Client请求启动
```bash
python client.py
```
## 服务化自定义参数
### Server 自定义参数
#### schema替换
```python
# Default schema
schema = ['开票日期', '名称', '纳税人识别号', '开户行及账号', '金额', '价税合计', 'No', '税率', '地址、电话', '税额']
```
#### 设置模型路径
```
# Default task_path
uie = Taskflow('information_extraction', task_path='../../checkpoint/model_best/', schema=schema)
```
#### 多卡服务化预测
PaddleNLP SimpleServing 支持多卡负载均衡预测,主要在服务化注册的时候,注册两个Taskflow的task即可,下面是示例代码
```
uie1 = Taskflow('information_extraction', task_path='../../checkpoint/model_best/', schema=schema, device_id=0)
uie2 = Taskflow('information_extraction', task_path='../../checkpoint/model_best/', schema=schema, device_id=1)
service.register_taskflow('uie', [uie1, uie2])
```
### Client 自定义参数
```python
# Changed to image paths you wanted
image_paths = ['../../data/images/b1.jpg']
```
# Service deployment based on PaddleNLP SimpleServing
## Table of contents
- [Environment Preparation](#1)
- [Server](#2)
- [Client](#3)
- [Service Custom Parameters](#4)
<a name="1"></a>
## Environment Preparation
Use the PaddleNLP version with SimpleServing function (or the latest develop version)
```shell
pip install paddlenlp >= 2.4.4
```
<a name="2"></a>
## Server
```bash
paddlenlp server server:app --workers 1 --host 0.0.0.0 --port 8189
```
<a name="3"></a>
## Client
```bash
python client.py
```
<a name="4"></a>
## Service custom parameters
### Server Custom Parameters
#### schema replacement
```python
# Default schema
schema = ['Billing Date', 'Name', 'Taxpayer Identification Number', 'Account Bank and Account Number', 'Amount', 'Total Price and Tax', 'No', 'Tax Rate', 'Address, Phone', 'tax']
```
#### Set model path
```
# Default task_path
uie = Taskflow('information_extraction', task_path='../../checkpoint/model_best/', schema=schema)
```
#### Doka Service Prediction
PaddleNLP SimpleServing supports multi-card load balancing prediction, mainly during service registration, just register two Taskflow tasks, the following is the sample code
```
uie1 = Taskflow('information_extraction', task_path='../../checkpoint/model_best/', schema=schema, device_id=0)
uie2 = Taskflow('information_extraction', task_path='../../checkpoint/model_best/', schema=schema, device_id=1)
service. register_taskflow('uie', [uie1, uie2])
```
### Client Custom Parameters
```python
# Changed to image paths you wanted
image_paths = ['../../data/images/b1.jpg']
```
# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import json
import requests
from paddlenlp.utils.doc_parser import DocParser
# Define the document parser
doc_parser = DocParser()
image_paths = ["../../data/images/b1.jpg"]
image_base64_docs = []
# Get the image base64 to post
for image_path in image_paths:
req_dict = {}
doc = doc_parser.parse({"doc": image_path}, do_ocr=False)
base64 = doc["image"]
req_dict["doc"] = base64
image_base64_docs.append(req_dict)
url = "http://0.0.0.0:8189/taskflow/uie"
headers = {"Content-Type": "application/json"}
data = {"data": {"text": image_base64_docs}}
# Post the requests
r = requests.post(url=url, headers=headers, data=json.dumps(data))
datas = json.loads(r.text)
print(datas)
# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from paddlenlp import SimpleServer, Taskflow
# The schema changed to your defined schema
schema = ["开票日期", "名称", "纳税人识别号", "开户行及账号", "金额", "价税合计", "No", "税率", "地址、电话", "税额"]
# The task path changed to your best model path
uie = Taskflow(
"information_extraction",
schema=schema,
task_path="../../checkpoint/model_best",
)
# If you want to define the finetuned uie service
app = SimpleServer()
app.register_taskflow("taskflow/uie", uie)
# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from dataclasses import dataclass, field
from functools import partial
from typing import Optional
import paddle
from utils import convert_example, reader
from paddlenlp.datasets import MapDataset, load_dataset
from paddlenlp.trainer import PdArgumentParser, Trainer, TrainingArguments
from paddlenlp.transformers import UIEX, AutoTokenizer
from paddlenlp.utils.ie_utils import (
compute_metrics,
get_relation_type_dict,
uie_loss_func,
unify_prompt_name,
)
from paddlenlp.utils.log import logger
@dataclass
class DataArguments:
"""
Arguments pertaining to what data we are going to input our model for evaluation.
Using `PdArgumentParser` we can turn this class into argparse arguments to be able to
specify them on the command line.
"""
test_path: str = field(default=None, metadata={"help": "The path of test set."})
schema_lang: str = field(
default="ch", metadata={"help": "Select the language type for schema, such as 'ch', 'en'"}
)
max_seq_len: Optional[int] = field(
default=512,
metadata={
"help": "The maximum total input sequence length after tokenization. Sequences longer "
"than this will be truncated, sequences shorter will be padded."
},
)
debug: bool = field(
default=False,
metadata={"help": "Whether choose debug mode."},
)
@dataclass
class ModelArguments:
"""
Arguments pertaining to which model/config/tokenizer we are going to fine-tune from.
"""
model_path: Optional[str] = field(
default=None, metadata={"help": "The path of saved model that you want to load."}
)
def do_eval():
parser = PdArgumentParser((ModelArguments, DataArguments, TrainingArguments))
model_args, data_args, training_args = parser.parse_args_into_dataclasses()
# Log model and data config
training_args.print_config(model_args, "Model")
training_args.print_config(data_args, "Data")
paddle.set_device(training_args.device)
tokenizer = AutoTokenizer.from_pretrained(model_args.model_path)
model = UIEX.from_pretrained(model_args.model_path)
test_ds = load_dataset(reader, data_path=data_args.test_path, max_seq_len=data_args.max_seq_len, lazy=False)
trans_fn = partial(convert_example, tokenizer=tokenizer, max_seq_len=data_args.max_seq_len)
if data_args.debug:
class_dict = {}
relation_data = []
for data in test_ds:
class_name = unify_prompt_name(data["prompt"])
# Only positive examples are evaluated in debug mode
if len(data["result_list"]) != 0:
p = "的" if data_args.schema_lang == "ch" else " of "
if p not in data["prompt"]:
class_dict.setdefault(class_name, []).append(data)
else:
relation_data.append((data["prompt"], data))
relation_type_dict = get_relation_type_dict(relation_data, schema_lang=data_args.schema_lang)
test_ds = test_ds.map(trans_fn)
trainer = Trainer(
model=model,
criterion=uie_loss_func,
args=training_args,
eval_dataset=test_ds,
tokenizer=tokenizer,
compute_metrics=compute_metrics,
)
eval_metrics = trainer.evaluate()
logger.info("-----Evaluate model-------")
logger.info("Class Name: ALL CLASSES")
logger.info(
"Evaluation Precision: %.5f | Recall: %.5f | F1: %.5f"
% (eval_metrics["eval_precision"], eval_metrics["eval_recall"], eval_metrics["eval_f1"])
)
logger.info("-----------------------------")
if data_args.debug:
for key in class_dict.keys():
test_ds = MapDataset(class_dict[key])
test_ds = test_ds.map(trans_fn)
eval_metrics = trainer.evaluate(eval_dataset=test_ds)
logger.info("Class Name: %s" % key)
logger.info(
"Evaluation Precision: %.5f | Recall: %.5f | F1: %.5f"
% (eval_metrics["eval_precision"], eval_metrics["eval_recall"], eval_metrics["eval_f1"])
)
logger.info("-----------------------------")
for key in relation_type_dict.keys():
test_ds = MapDataset(relation_type_dict[key])
test_ds = test_ds.map(trans_fn)
eval_metrics = trainer.evaluate(eval_dataset=test_ds)
logger.info("-----------------------------")
if data_args.schema_lang == "ch":
logger.info("Class Name: X的%s" % key)
else:
logger.info("Class Name: %s of X" % key)
logger.info(
"Evaluation Precision: %.5f | Recall: %.5f | F1: %.5f"
% (eval_metrics["eval_precision"], eval_metrics["eval_recall"], eval_metrics["eval_f1"])
)
if __name__ == "__main__":
do_eval()
# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os
from dataclasses import dataclass, field
from functools import partial
from typing import List, Optional
import paddle
from utils import convert_example, reader
from paddlenlp.datasets import load_dataset
from paddlenlp.trainer import (
CompressionArguments,
PdArgumentParser,
Trainer,
get_last_checkpoint,
)
from paddlenlp.transformers import UIEX, AutoTokenizer, export_model
from paddlenlp.utils.ie_utils import compute_metrics, uie_loss_func
from paddlenlp.utils.log import logger
@dataclass
class DataArguments:
"""
Arguments pertaining to what data we are going to input our model for training and eval.
Using `PdArgumentParser` we can turn this class into argparse arguments to be able to
specify them on the command line.
"""
train_path: str = field(
default=None, metadata={"help": "The name of the dataset to use (via the datasets library)."}
)
dev_path: str = field(
default=None, metadata={"help": "The name of the dataset to use (via the datasets library)."}
)
max_seq_len: Optional[int] = field(
default=512,
metadata={
"help": "The maximum total input sequence length after tokenization. Sequences longer "
"than this will be truncated, sequences shorter will be padded."
},
)
dynamic_max_length: Optional[List[int]] = field(
default=None,
metadata={"help": "dynamic max length from batch, it can be array of length, eg: 16 32 64 128"},
)
@dataclass
class ModelArguments:
"""
Arguments pertaining to which model/config/tokenizer we are going to fine-tune from.
"""
model_name_or_path: Optional[str] = field(default="uie-x-base", metadata={"help": "Path to pretrained model"})
export_model_dir: Optional[str] = field(
default=None,
metadata={"help": "Path to directory to store the exported inference model."},
)
def main():
parser = PdArgumentParser((ModelArguments, DataArguments, CompressionArguments))
model_args, data_args, training_args = parser.parse_args_into_dataclasses()
training_args.label_names = ["start_positions", "end_positions"]
# Log model and data config
training_args.print_config(model_args, "Model")
training_args.print_config(data_args, "Data")
paddle.set_device(training_args.device)
# Log on each process the small summary:
logger.warning(
f"Process rank: {training_args.local_rank}, device: {training_args.device}, world_size: {training_args.world_size}, "
+ f"distributed training: {bool(training_args.local_rank != -1)}, 16-bits training: {training_args.fp16}"
)
# Detecting last checkpoint.
last_checkpoint = None
if os.path.isdir(training_args.output_dir) and training_args.do_train and not training_args.overwrite_output_dir:
last_checkpoint = get_last_checkpoint(training_args.output_dir)
if last_checkpoint is None and len(os.listdir(training_args.output_dir)) > 0:
raise ValueError(
f"Output directory ({training_args.output_dir}) already exists and is not empty. "
"Use --overwrite_output_dir to overcome."
)
elif last_checkpoint is not None and training_args.resume_from_checkpoint is None:
logger.info(
f"Checkpoint detected, resuming training at {last_checkpoint}. To avoid this behavior, change "
"the `--output_dir` or add `--overwrite_output_dir` to train from scratch."
)
# Define model and tokenizer
model = UIEX.from_pretrained(model_args.model_name_or_path)
tokenizer = AutoTokenizer.from_pretrained(model_args.model_name_or_path)
# Load and preprocess dataset
train_ds = load_dataset(reader, data_path=data_args.train_path, max_seq_len=data_args.max_seq_len, lazy=False)
dev_ds = load_dataset(reader, data_path=data_args.dev_path, max_seq_len=data_args.max_seq_len, lazy=False)
trans_fn = partial(
convert_example,
tokenizer=tokenizer,
max_seq_len=data_args.max_seq_len,
dynamic_max_length=data_args.dynamic_max_length,
)
train_ds = train_ds.map(trans_fn)
dev_ds = dev_ds.map(trans_fn)
trainer = Trainer(
model=model,
criterion=uie_loss_func,
args=training_args,
train_dataset=train_ds if training_args.do_train else None,
eval_dataset=dev_ds if training_args.do_eval else None,
tokenizer=tokenizer,
compute_metrics=compute_metrics,
)
trainer.optimizer = paddle.optimizer.AdamW(
learning_rate=training_args.learning_rate, parameters=model.parameters()
)
checkpoint = None
if training_args.resume_from_checkpoint is not None:
checkpoint = training_args.resume_from_checkpoint
elif last_checkpoint is not None:
checkpoint = last_checkpoint
# Training
if training_args.do_train:
train_result = trainer.train(resume_from_checkpoint=checkpoint)
metrics = train_result.metrics
trainer.save_model()
trainer.log_metrics("train", metrics)
trainer.save_metrics("train", metrics)
trainer.save_state()
# Evaluate and tests model
if training_args.do_eval:
eval_metrics = trainer.evaluate()
trainer.log_metrics("eval", eval_metrics)
# export inference model
if training_args.do_export:
# You can also load from certain checkpoint
# trainer.load_state_dict_from_checkpoint("/path/to/checkpoint/")
input_spec = [
paddle.static.InputSpec(shape=[None, None], dtype="int64", name="input_ids"),
paddle.static.InputSpec(shape=[None, None], dtype="int64", name="token_type_ids"),
paddle.static.InputSpec(shape=[None, None], dtype="int64", name="position_ids"),
paddle.static.InputSpec(shape=[None, None], dtype="int64", name="attention_mask"),
paddle.static.InputSpec(shape=[None, None, 4], dtype="int64", name="bbox"),
paddle.static.InputSpec(shape=[None, 3, 224, 224], dtype="int64", name="image"),
]
if model_args.export_model_dir is None:
model_args.export_model_dir = os.path.join(training_args.output_dir, "export")
export_model(model=trainer.model, input_spec=input_spec, path=model_args.export_model_dir)
if __name__ == "__main__":
main()
# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import base64
import json
from typing import List, Optional
import numpy as np
from paddlenlp.utils.ie_utils import map_offset, pad_image_data
from paddlenlp.utils.log import logger
def reader(data_path, max_seq_len=512):
"""
read json
"""
with open(data_path, "r", encoding="utf-8") as f:
for line in f:
json_line = json.loads(line)
content = json_line["content"].strip()
prompt = json_line["prompt"]
boxes = json_line.get("bbox", None)
image = json_line.get("image", None)
# Model Input is aslike: [CLS] prompt [SEP] [SEP] text [SEP] for UIE-X
if boxes is not None and image is not None:
summary_token_num = 4
else:
summary_token_num = 3
if max_seq_len <= len(prompt) + summary_token_num:
raise ValueError("The value of max_seq_len is too small, please set a larger value")
max_content_len = max_seq_len - len(prompt) - summary_token_num
if len(content) <= max_content_len:
yield json_line
else:
result_list = json_line["result_list"]
json_lines = []
accumulate = 0
while True:
cur_result_list = []
for result in result_list:
if result["end"] - result["start"] > max_content_len:
logger.warning(
"result['end'] - result ['start'] exceeds max_content_len, which will result in no valid instance being returned"
)
if (
result["start"] + 1 <= max_content_len < result["end"]
and result["end"] - result["start"] <= max_content_len
):
max_content_len = result["start"]
break
cur_content = content[:max_content_len]
res_content = content[max_content_len:]
if boxes is not None and image is not None:
cur_boxes = boxes[:max_content_len]
res_boxes = boxes[max_content_len:]
while True:
if len(result_list) == 0:
break
elif result_list[0]["end"] <= max_content_len:
if result_list[0]["end"] > 0:
cur_result = result_list.pop(0)
cur_result_list.append(cur_result)
else:
cur_result_list = [result for result in result_list]
break
else:
break
if boxes is not None and image is not None:
json_line = {
"content": cur_content,
"result_list": cur_result_list,
"prompt": prompt,
"bbox": cur_boxes,
"image": image,
}
else:
json_line = {
"content": cur_content,
"result_list": cur_result_list,
"prompt": prompt,
}
json_lines.append(json_line)
for result in result_list:
if result["end"] <= 0:
break
result["start"] -= max_content_len
result["end"] -= max_content_len
accumulate += max_content_len
max_content_len = max_seq_len - len(prompt) - summary_token_num
if len(res_content) == 0:
break
elif len(res_content) < max_content_len:
if boxes is not None and image is not None:
json_line = {
"content": res_content,
"result_list": result_list,
"prompt": prompt,
"bbox": res_boxes,
"image": image,
}
else:
json_line = {"content": res_content, "result_list": result_list, "prompt": prompt}
json_lines.append(json_line)
break
else:
content = res_content
boxes = res_boxes
for json_line in json_lines:
yield json_line
def get_dynamic_max_len(examples, default_max_len: int, dynamic_max_length: List[int]) -> int:
"""get max_length by examples which you can change it by examples in batch"""
cur_length = len(examples[0]["input_ids"])
max_length = default_max_len
for max_length_option in sorted(dynamic_max_length):
if cur_length <= max_length_option:
max_length = max_length_option
break
return max_length
def convert_example(
example,
tokenizer,
max_seq_len,
pad_id=1,
c_sep_id=2,
summary_token_num=4,
dynamic_max_length: Optional[List[int]] = None,
):
content = example["content"]
prompt = example["prompt"]
bbox_lines = example.get("bbox", None)
image_buff_string = example.get("image", None)
# Text
if bbox_lines is None or image_buff_string is None:
if dynamic_max_length is not None:
temp_encoded_inputs = tokenizer(
text=[example["prompt"]],
text_pair=[example["content"]],
truncation=True,
max_seq_len=max_seq_len,
return_attention_mask=True,
return_position_ids=True,
return_dict=False,
return_offsets_mapping=True,
)
max_length = get_dynamic_max_len(
examples=temp_encoded_inputs, default_max_len=max_seq_len, dynamic_max_length=dynamic_max_length
)
# always pad to max_length
encoded_inputs = tokenizer(
text=[example["prompt"]],
text_pair=[example["content"]],
truncation=True,
max_seq_len=max_length,
pad_to_max_seq_len=True,
return_attention_mask=True,
return_position_ids=True,
return_dict=False,
return_offsets_mapping=True,
)
max_seq_len = max_length
else:
encoded_inputs = tokenizer(
text=[example["prompt"]],
text_pair=[example["content"]],
truncation=True,
max_seq_len=max_seq_len,
pad_to_max_seq_len=True,
return_attention_mask=True,
return_position_ids=True,
return_offsets_mapping=True,
return_dict=False,
)
encoded_inputs = encoded_inputs[0]
inputs_ids = encoded_inputs["input_ids"]
position_ids = encoded_inputs["position_ids"]
attention_mask = encoded_inputs["attention_mask"]
q_sep_index = inputs_ids.index(2, 1)
c_sep_index = attention_mask.index(0)
offset_mapping = [list(x) for x in encoded_inputs["offset_mapping"]]
bias = 0
for index in range(len(offset_mapping)):
if index == 0:
continue
mapping = offset_mapping[index]
if mapping[0] == 0 and mapping[1] == 0 and bias == 0:
# bias = index
bias = offset_mapping[index - 1][-1] + 1
if mapping[0] == 0 and mapping[1] == 0:
continue
offset_mapping[index][0] += bias
offset_mapping[index][1] += bias
offset_bias = bias
bbox_list = [[0, 0, 0, 0] for x in range(len(inputs_ids))]
token_type_ids = [
1 if token_index <= q_sep_index or token_index > c_sep_index else 0 for token_index in range(max_seq_len)
]
padded_image = np.zeros([3, 224, 224])
# Doc
else:
inputs_ids = []
prev_bbox = [-1, -1, -1, -1]
this_text_line = ""
q_sep_index = -1
offset_mapping = []
last_offset = 0
for char_index, (char, bbox) in enumerate(zip(content, bbox_lines)):
if char_index == 0:
prev_bbox = bbox
this_text_line = char
continue
if all([bbox[x] == prev_bbox[x] for x in range(4)]):
this_text_line += char
else:
offset_mapping, last_offset, q_sep_index, inputs_ids = _encode_doc(
tokenizer,
offset_mapping,
last_offset,
prompt,
this_text_line,
inputs_ids,
q_sep_index,
max_seq_len,
)
this_text_line = char
prev_bbox = bbox
if len(this_text_line) > 0:
offset_mapping, last_offset, q_sep_index, inputs_ids = _encode_doc(
tokenizer, offset_mapping, last_offset, prompt, this_text_line, inputs_ids, q_sep_index, max_seq_len
)
if len(inputs_ids) > max_seq_len:
inputs_ids = inputs_ids[: (max_seq_len - 1)] + [c_sep_id]
offset_mapping = offset_mapping[: (max_seq_len - 1)] + [[0, 0]]
else:
inputs_ids += [c_sep_id]
offset_mapping += [[0, 0]]
offset_bias = offset_mapping[q_sep_index - 1][-1] + 1
seq_len = len(inputs_ids)
inputs_ids += [pad_id] * (max_seq_len - seq_len)
token_type_ids = [1] * (q_sep_index + 1) + [0] * (seq_len - q_sep_index - 1)
token_type_ids += [pad_id] * (max_seq_len - seq_len)
bbox_list = _process_bbox(inputs_ids, bbox_lines, offset_mapping, offset_bias)
offset_mapping += [[0, 0]] * (max_seq_len - seq_len)
position_ids = list(range(seq_len))
position_ids = position_ids + [0] * (max_seq_len - seq_len)
attention_mask = [1] * seq_len + [0] * (max_seq_len - seq_len)
image_data = base64.b64decode(image_buff_string.encode("utf8"))
padded_image = pad_image_data(image_data)
start_ids = np.array([0.0 for x in range(max_seq_len)], dtype="int64")
end_ids = np.array([0.0 for x in range(max_seq_len)], dtype="int64")
for item in example["result_list"]:
start = map_offset(item["start"] + offset_bias, offset_mapping)
end = map_offset(item["end"] - 1 + offset_bias, offset_mapping)
start_ids[start] = 1.0
end_ids[end] = 1.0
assert len(inputs_ids) == max_seq_len
assert len(token_type_ids) == max_seq_len
assert len(position_ids) == max_seq_len
assert len(attention_mask) == max_seq_len
assert len(bbox_list) == max_seq_len
tokenized_output = {
"input_ids": inputs_ids,
"token_type_ids": token_type_ids,
"position_ids": position_ids,
"attention_mask": attention_mask,
"bbox": bbox_list,
"image": padded_image,
"start_positions": start_ids,
"end_positions": end_ids,
}
return tokenized_output
def _process_bbox(tokens, bbox_lines, offset_mapping, offset_bias):
bbox_list = [[0, 0, 0, 0] for x in range(len(tokens))]
for index, bbox in enumerate(bbox_lines):
index_token = map_offset(index + offset_bias, offset_mapping)
if 0 <= index_token < len(bbox_list):
bbox_list[index_token] = bbox
return bbox_list
def _encode_doc(tokenizer, offset_mapping, last_offset, prompt, this_text_line, inputs_ids, q_sep_index, max_seq_len):
if len(offset_mapping) == 0:
content_encoded_inputs = tokenizer(
text=[prompt],
text_pair=[this_text_line],
max_seq_len=max_seq_len,
return_dict=False,
return_offsets_mapping=True,
)
content_encoded_inputs = content_encoded_inputs[0]
inputs_ids = content_encoded_inputs["input_ids"][:-1]
sub_offset_mapping = [list(x) for x in content_encoded_inputs["offset_mapping"]]
q_sep_index = content_encoded_inputs["input_ids"].index(2, 1)
bias = 0
for i in range(len(sub_offset_mapping)):
if i == 0:
continue
mapping = sub_offset_mapping[i]
if mapping[0] == 0 and mapping[1] == 0 and bias == 0:
bias = sub_offset_mapping[i - 1][-1] + 1
if mapping[0] == 0 and mapping[1] == 0:
continue
if mapping == sub_offset_mapping[i - 1]:
continue
sub_offset_mapping[i][0] += bias
sub_offset_mapping[i][1] += bias
offset_mapping = sub_offset_mapping[:-1]
last_offset = offset_mapping[-1][-1]
else:
content_encoded_inputs = tokenizer(
text=this_text_line, max_seq_len=max_seq_len, return_dict=False, return_offsets_mapping=True
)
inputs_ids += content_encoded_inputs["input_ids"][1:-1]
sub_offset_mapping = [list(x) for x in content_encoded_inputs["offset_mapping"]]
for i, sub_list in enumerate(sub_offset_mapping[1:-1]):
if i == 0:
org_offset = sub_list[1]
else:
if sub_list[0] != org_offset and sub_offset_mapping[1:-1][i - 1] != sub_list:
last_offset += 1
org_offset = sub_list[1]
offset_mapping += [[last_offset, sub_list[1] - sub_list[0] + last_offset]]
last_offset = offset_mapping[-1][-1]
return offset_mapping, last_offset, q_sep_index, inputs_ids
# coding=utf-8
# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import argparse
import json
import os
import random
import time
from decimal import Decimal
import numpy as np
import paddle
from paddlenlp.trainer.argparser import strtobool
from paddlenlp.utils.log import logger
from paddlenlp.utils.tools import DataConverter
def set_seed(seed):
paddle.seed(seed)
random.seed(seed)
np.random.seed(seed)
def do_convert():
set_seed(args.seed)
tic_time = time.time()
if not os.path.exists(args.label_studio_file):
raise ValueError("Please input the correct path of label studio file.")
if not os.path.exists(args.save_dir):
os.makedirs(args.save_dir)
if len(args.splits) != 0 and len(args.splits) != 3:
raise ValueError("Only []/ len(splits)==3 accepted for splits.")
def _check_sum(splits):
return Decimal(str(splits[0])) + Decimal(str(splits[1])) + Decimal(str(splits[2])) == Decimal("1")
if len(args.splits) == 3 and not _check_sum(args.splits):
raise ValueError("Please set correct splits, sum of elements in splits should be equal to 1.")
with open(args.label_studio_file, "r", encoding="utf-8") as f:
raw_examples = json.loads(f.read())
if args.is_shuffle:
indexes = np.random.permutation(len(raw_examples))
index_list = indexes.tolist()
raw_examples = [raw_examples[i] for i in indexes]
i1, i2, _ = args.splits
p1 = int(len(raw_examples) * i1)
p2 = int(len(raw_examples) * (i1 + i2))
train_ids = index_list[:p1]
dev_ids = index_list[p1:p2]
test_ids = index_list[p2:]
with open(os.path.join(args.save_dir, "sample_index.json"), "w") as fp:
maps = {"train_ids": train_ids, "dev_ids": dev_ids, "test_ids": test_ids}
fp.write(json.dumps(maps))
if raw_examples[0]["data"].get("image"):
anno_type = "image"
else:
anno_type = "text"
data_converter = DataConverter(
args.label_studio_file,
negative_ratio=args.negative_ratio,
prompt_prefix=args.prompt_prefix,
options=args.options,
separator=args.separator,
layout_analysis=args.layout_analysis,
schema_lang=args.schema_lang,
ocr_lang=args.ocr_lang,
anno_type=anno_type,
)
if args.task_type == "ext":
train_examples = data_converter.convert_ext_examples(raw_examples[:p1])
dev_examples = data_converter.convert_ext_examples(raw_examples[p1:p2], is_train=False)
test_examples = data_converter.convert_ext_examples(raw_examples[p2:], is_train=False)
else:
train_examples = data_converter.convert_cls_examples(raw_examples[:p1])
dev_examples = data_converter.convert_cls_examples(raw_examples[p1:p2])
test_examples = data_converter.convert_cls_examples(raw_examples[p2:])
def _save_examples(save_dir, file_name, examples):
count = 0
save_path = os.path.join(save_dir, file_name)
with open(save_path, "w", encoding="utf-8") as f:
for example in examples:
f.write(json.dumps(example, ensure_ascii=False) + "\n")
count += 1
logger.info("Save %d examples to %s." % (count, save_path))
_save_examples(args.save_dir, "train.txt", train_examples)
_save_examples(args.save_dir, "dev.txt", dev_examples)
_save_examples(args.save_dir, "test.txt", test_examples)
logger.info("Finished! It takes %.2f seconds" % (time.time() - tic_time))
if __name__ == "__main__":
# yapf: disable
parser = argparse.ArgumentParser()
parser.add_argument("--label_studio_file", default="./data/label_studio.json", type=str, help="The annotation file exported from label studio platform.")
parser.add_argument("--save_dir", default="./data", type=str, help="The path of data that you wanna save.")
parser.add_argument("--negative_ratio", default=5, type=int, help="Used only for the extraction task, the ratio of positive and negative samples, number of negtive samples = negative_ratio * number of positive samples")
parser.add_argument("--splits", default=[0.8, 0.1, 0.1], type=float, nargs="*", help="The ratio of samples in datasets. [0.6, 0.2, 0.2] means 60% samples used for training, 20% for evaluation and 20% for test.")
parser.add_argument("--task_type", choices=['ext', 'cls'], default="ext", type=str, help="Select task type, ext for the extraction task and cls for the classification task, defaults to ext.")
parser.add_argument("--options", default=["正向", "负向"], type=str, nargs="+", help="Used only for the classification task, the options for classification")
parser.add_argument("--prompt_prefix", default="情感倾向", type=str, help="Used only for the classification task, the prompt prefix for classification")
parser.add_argument("--is_shuffle", default="True", type=strtobool, help="Whether to shuffle the labeled dataset, defaults to True.")
parser.add_argument("--layout_analysis", default=False, type=bool, help="Enable layout analysis to optimize the order of OCR result.")
parser.add_argument("--seed", type=int, default=1000, help="Random seed for initialization")
parser.add_argument("--separator", type=str, default='##', help="Used only for entity/aspect-level classification task, separator for entity label and classification label")
parser.add_argument("--schema_lang", choices=["ch", "en"], default="ch", help="Select the language type for schema.")
parser.add_argument("--ocr_lang", choices=["ch", "en"], default="ch", help="Select the language type for OCR.")
args = parser.parse_args()
# yapf: enable
do_convert()
# 文档抽取任务Label Studio使用指南
**目录**
- [1. 安装](#1)
- [2. 文档抽取任务标注](#2)
- [2.1 项目创建](#21)
- [2.2 数据上传](#22)
- [2.3 标签构建](#23)
- [2.4 任务标注](#24)
- [2.5 数据导出](#25)
- [2.6 数据转换](#26)
- [2.7 更多配置](#27)
<a name="1"></a>
## 1. 安装
**以下标注示例用到的环境配置:**
- Python 3.8+
- label-studio == 1.6.0
- paddleocr >= 2.6.0.1
在终端(terminal)使用pip安装label-studio:
```shell
pip install label-studio==1.6.0
```
安装完成后,运行以下命令行:
```shell
label-studio start
```
在浏览器打开[http://localhost:8080/](http://127.0.0.1:8080/),输入用户名和密码登录,开始使用label-studio进行标注。
<a name="2"></a>
## 2. 文档抽取任务标注
<a name="21"></a>
#### 2.1 项目创建
点击创建(Create)开始创建一个新的项目,填写项目名称、描述,然后选择``Object Detection with Bounding Boxes``
- 填写项目名称、描述
<div align="center">
<img src=https://user-images.githubusercontent.com/40840292/199445809-1206f887-2782-459e-9001-fbd790d59a5e.png height=300 width=1200 />
</div>
- **命名实体识别、关系抽取、事件抽取、实体/评价维度分类**任务选择``Object Detection with Bounding Boxes`
<div align="center">
<img src=https://user-images.githubusercontent.com/40840292/199660090-d84901dd-001d-4620-bffa-0101a4ecd6e5.png height=400 width=1200 />
</div>
- **文档分类**任务选择``Image Classification`
<div align="center">
<img src=https://user-images.githubusercontent.com/40840292/199729973-53a994d8-da71-4ab9-84f5-83297e19a7a1.png height=400 width=1200 />
</div>
- 添加标签(也可跳过后续在Setting/Labeling Interface中添加)
<div align="center">
<img src=https://user-images.githubusercontent.com/40840292/199450930-4c0cd189-6085-465a-aca0-6ba6f52a0c0d.png height=600 width=1200 />
</div>
图中展示了Span实体类型标签的构建,其他类型标签的构建可参考[2.3标签构建](#23)
<a name="22"></a>
#### 2.2 数据上传
先从本地或HTTP链接上传图片,然后选择导入本项目。
<div align="center">
<img src=https://user-images.githubusercontent.com/40840292/199452007-2d45f7ba-c631-46b4-b21f-729a2ed652e9.png height=270 width=1200 />
</div>
<a name="23"></a>
#### 2.3 标签构建
- Span实体类型标签
<div align="center">
<img src=https://user-images.githubusercontent.com/40840292/199456432-ce601ab0-7d6c-458f-ac46-8839dbc4d013.png height=500 width=1200 />
</div>
- Relation关系类型标签
<div align="center">
<img src=https://user-images.githubusercontent.com/40840292/199877621-f60e00c7-81ae-42e1-b498-8ebc5b5bd0fd.png height=650 width=1200 />
</div>
Relation XML模板:
```xml
<Relations>
<Relation value="单位"/>
<Relation value="数量"/>
<Relation value="金额"/>
</Relations>
```
- 分类类别标签
<div align="center">
<img src=https://user-images.githubusercontent.com/40840292/199891626-cc995783-18d2-41dc-88de-260b979edc56.png height=500 width=1200 />
</div>
<a name="24"></a>
#### 2.4 任务标注
- 实体抽取
- 标注示例:
<div align="center">
<img src=https://user-images.githubusercontent.com/40840292/199879427-82806ffc-dc60-4ec7-bda5-e16419ee9d15.png height=650 width=800 />
</div>
- 该标注示例对应的schema为:
```text
schema = ['开票日期', '名称', '纳税人识别号', '地址、电话', '开户行及账号', '金额', '税额', '价税合计', 'No', '税率']
```
- 关系抽取
- Step 1. 标注主体(Subject)及客体(Object)
<div align="center">
<img src=https://user-images.githubusercontent.com/40840292/218974459-4bf989fc-0e40-4dea-b309-346364cca1b5.png height=400 width=1000 />
</div>
- Step 2. 关系连线,箭头方向由主体(Subject)指向客体(Object)
<div align="center">
<img src=https://user-images.githubusercontent.com/40840292/218975474-0cf933bc-7c1e-4e7d-ada5-685ee5265f61.png height=450 width=1000 />
</div>
<div align="center">
<img src=https://user-images.githubusercontent.com/40840292/218975743-dc718068-6d58-4352-8eb2-8973549dd971.png height=400 width=1000 />
</div>
- Step 3. 添加对应关系类型标签
<div align="center">
<img src=https://user-images.githubusercontent.com/40840292/218976095-ff5a84e8-302c-4789-98df-139a8cef8d5a.png height=360 width=1000 />
</div>
<div align="center">
<img src=https://user-images.githubusercontent.com/40840292/218976368-a4556441-46ca-4372-b68b-e00b45f59260.png height=360 width=1000 />
</div>
- Step 4. 完成标注
<div align="center">
<img src=https://user-images.githubusercontent.com/40840292/218976853-4903f2ec-b669-4c63-8c21-5f7184fc03db.png height=450 width=1000 />
</div>
- 该标注示例对应的schema为:
```text
schema = {
'名称及规格': [
'金额',
'单位',
'数量'
]
}
```
- 文档分类
- 标注示例
<div align="center">
<img src=https://user-images.githubusercontent.com/40840292/199879238-b8b41d4a-7e77-47cd-8def-2fc8ba89442f.png height=650 width=800 />
</div>
- 该标注示例对应的schema为:
```text
schema = '文档类别[发票,报关单]'
```
<a name="25"></a>
#### 2.5 数据导出
勾选已标注图片ID,选择导出的文件类型为``JSON``,导出数据:
<div align="center">
<img src=https://user-images.githubusercontent.com/40840292/199890897-b33ede99-97d8-4d44-877a-2518a87f8b67.png height=200 width=1200 />
</div>
<a name="26"></a>
#### 2.6 数据转换
将导出的文件重命名为``label_studio.json``后,放入``./document/data``目录下,并将对应的标注图片放入``./document/data/images``目录下(图片的文件名需与上传到label studio时的命名一致)。通过[label_studio.py](./label_studio.py)脚本可转为UIE的数据格式。
- 路径示例
```shell
./document/data/
├── images # 图片目录
│ ├── b0.jpg # 原始图片(文件名需与上传到label studio时的命名一致)
│ └── b1.jpg
└── label_studio.json # 从label studio导出的标注文件
```
- 抽取式任务
```shell
python label_studio.py \
--label_studio_file ./document/data/label_studio.json \
--save_dir ./document/data \
--splits 0.8 0.1 0.1\
--task_type ext
```
- 文档分类任务
```shell
python label_studio.py \
--label_studio_file ./document/data/label_studio.json \
--save_dir ./document/data \
--splits 0.8 0.1 0.1 \
--task_type cls \
--prompt_prefix "文档类别" \
--options "发票" "报关单"
```
<a name="27"></a>
#### 2.7 更多配置
- ``label_studio_file``: 从label studio导出的数据标注文件。
- ``save_dir``: 训练数据的保存目录,默认存储在``data``目录下。
- ``negative_ratio``: 最大负例比例,该参数只对抽取类型任务有效,适当构造负例可提升模型效果。负例数量和实际的标签数量有关,最大负例数量 = negative_ratio * 正例数量。该参数只对训练集有效,默认为5。为了保证评估指标的准确性,验证集和测试集默认构造全负例。
- ``splits``: 划分数据集时训练集、验证集所占的比例。默认为[0.8, 0.1, 0.1]表示按照``8:1:1``的比例将数据划分为训练集、验证集和测试集。
- ``task_type``: 选择任务类型,可选有抽取和分类两种类型的任务。
- ``options``: 指定分类任务的类别标签,该参数只对分类类型任务有效。默认为["正向", "负向"]。
- ``prompt_prefix``: 声明分类任务的prompt前缀信息,该参数只对分类类型任务有效。默认为"情感倾向"。
- ``is_shuffle``: 是否对数据集进行随机打散,默认为True。
- ``seed``: 随机种子,默认为1000.
- ``separator``: 实体类别/评价维度与分类标签的分隔符,该参数只对实体/评价维度分类任务有效。默认为"##"。
- ``schema_lang``:选择schema的语言,将会应该训练数据prompt的构造方式,可选有`ch`和`en`。默认为`ch`。
- ``ocr_lang``:选择OCR的语言,可选有`ch`和`en`。默认为`ch`。
- ``layout_analysis``:是否使用PPStructure对文档进行布局分析,该参数只对文档类型标注任务有效。默认为False。
备注:
- 默认情况下 [label_studio.py](./label_studio.py) 脚本会按照比例将数据划分为 train/dev/test 数据集
- 每次执行 [label_studio.py](./label_studio.py) 脚本,将会覆盖已有的同名数据文件
- 在模型训练阶段我们推荐构造一些负例以提升模型效果,在数据转换阶段我们内置了这一功能。可通过`negative_ratio`控制自动构造的负样本比例;负样本数量 = negative_ratio * 正样本数量。
- 对于从label_studio导出的文件,默认文件中的每条数据都是经过人工正确标注的。
## References
- **[Label Studio](https://labelstud.io/)**
# Label Studio User Guide - Document Information Extraction
**Table of contents**
- [1. Installation](#1)
- [2. Document Extraction Task Annotation](#2)
- [2.1 Project Creation](#21)
- [2.2 Data Upload](#22)
- [2.3 Label Construction](#23)
- [2.4 Task Annotation](#24)
- [2.5 Data Export](#25)
- [2.6 Data Conversion](#26)
- [2.7 More Configuration](#27)
<a name="1"></a>
## 1. Installation
**Environmental configuration used in the following annotation examples:**
- Python 3.8+
- label-studio == 1.6.0
- paddleocr >= 2.6.0.1
Use pip to install label-studio in the terminal:
```shell
pip install label-studio==1.6.0
```
Once the installation is complete, run the following command line:
```shell
label-studio start
```
Open [http://localhost:8080/](http://127.0.0.1:8080/) in the browser, enter the user name and password to log in, and start using label-studio for labeling.
<a name="2"></a>
## 2. Document Extraction Task Annotation
<a name="21"></a>
#### 2.1 Project Creation
Click Create to start creating a new project, fill in the project name, description, and select ``Object Detection with Bounding Boxes``.
- Fill in the project name, description
<div align="center">
<img src=https://user-images.githubusercontent.com/40840292/199445809-1206f887-2782-459e-9001-fbd790d59a5e.png height=300 width=1200 />
</div>
- For **Named Entity Recognition, Relation Extraction** tasks please select ``Object Detection with Bounding Boxes`
<div align="center">
<img src=https://user-images.githubusercontent.com/40840292/199660090-d84901dd-001d-4620-bffa-0101a4ecd6e5.png height=400 width=1200 />
</div>
- For **Document Classification** task please select ``Image Classification`
<div align="center">
<img src=https://user-images.githubusercontent.com/40840292/199729973-53a994d8-da71-4ab9-84f5-83297e19a7a1.png height=400 width=1200 />
</div>
- Define labels
<div align="center">
<img src=https://user-images.githubusercontent.com/40840292/199450930-4c0cd189-6085-465a-aca0-6ba6f52a0c0d.png height=600 width=1200 />
</div>
The figure shows the construction of Span entity type tags. For the construction of other types of tags, please refer to [2.3 Label Construction](#23)
<a name="22"></a>
#### 2.2 Data upload
First upload the picture from a local or HTTP link, and then choose to import this project.
<div align="center">
<img src=https://user-images.githubusercontent.com/40840292/199452007-2d45f7ba-c631-46b4-b21f-729a2ed652e9.png height=270 width=1200 />
</div>
<a name="23"></a>
#### 2.3 Label Construction
- Entity Label
<div align="center">
<img src=https://user-images.githubusercontent.com/40840292/199456432-ce601ab0-7d6c-458f-ac46-8839dbc4d013.png height=500 width=1200 />
</div>
- Relation label
<div align="center">
<img src=https://user-images.githubusercontent.com/40840292/199877621-f60e00c7-81ae-42e1-b498-8ebc5b5bd0fd.png height=650 width=1200 />
</div>
Relation XML template:
```xml
<Relations>
<Relation value="unit"/>
<Relation value="Quantity"/>
<Relation value="amount"/>
</Relations>
```
- Classification label
<div align="center">
<img src=https://user-images.githubusercontent.com/40840292/199891626-cc995783-18d2-41dc-88de-260b979edc56.png height=500 width=1200 />
</div>
<a name="24"></a>
#### 2.4 Task Annotation
- Entity extraction
- Callout example:
<div align="center">
<img src=https://user-images.githubusercontent.com/40840292/199879427-82806ffc-dc60-4ec7-bda5-e16419ee9d15.png height=650 width=800 />
</div>
- The schema corresponding to this annotation example is:
```text
schema = ['开票日期', '名称', '纳税人识别号', '地址、电话', '开户行及账号', '金额', '税额', '价税合计', 'No', '税率']
```
- Relation extraction
- Step 1. Label the subject and object
<div align="center">
<img src=https://user-images.githubusercontent.com/40840292/218974459-4bf989fc-0e40-4dea-b309-346364cca1b5.png height=400 width=1000 />
</div>
- Step 2. Relation line, the direction of the arrow is from the subject to the object
<div align="center">
<img src=https://user-images.githubusercontent.com/40840292/218975474-0cf933bc-7c1e-4e7d-ada5-685ee5265f61.png height=450 width=1000 />
</div>
<div align="center">
<img src=https://user-images.githubusercontent.com/40840292/218975743-dc718068-6d58-4352-8eb2-8973549dd971.png height=400 width=1000 />
</div>
- Step 3. Add corresponding relation label
<div align="center">
<img src=https://user-images.githubusercontent.com/40840292/218976095-ff5a84e8-302c-4789-98df-139a8cef8d5a.png height=360 width=1000 />
</div>
<div align="center">
<img src=https://user-images.githubusercontent.com/40840292/218976368-a4556441-46ca-4372-b68b-e00b45f59260.png height=360 width=1000 />
</div>
- Step 4. Finish labeling
<div align="center">
<img src=https://user-images.githubusercontent.com/40840292/218976853-4903f2ec-b669-4c63-8c21-5f7184fc03db.png height=450 width=1000 />
</div>
- The schema corresponding to this annotation example is:
```text
schema = {
'名称及规格': [
'金额',
'单位',
'数量'
]
}
```
- Document classification
- Callout example
<div align="center">
<img src=https://user-images.githubusercontent.com/40840292/199879238-b8b41d4a-7e77-47cd-8def-2fc8ba89442f.png height=650 width=800 />
</div>
- The schema corresponding to this annotation example is:
```text
schema = '文档类别[发票,报关单]'
```
<a name="25"></a>
#### 2.5 Data Export
Check the marked image ID, select the exported file type as ``JSON``, and export the data:
<div align="center">
<img src=https://user-images.githubusercontent.com/40840292/199890897-b33ede99-97d8-4d44-877a-2518a87f8b67.png height=200 width=1200 />
</div>
<a name="26"></a>
#### 2.6 Data Conversion
After renaming the exported file to ``label_studio.json``, put it into the ``./document/data`` directory, and put the corresponding label image into the ``./document/data/images`` directory (The file name of the picture must be the same as the one uploaded to label studio). Through the [label_studio.py](./label_studio.py) script, it can be converted to the data format of UIE.
- Path example
```shell
./document/data/
├── images # image directory
│ ├── b0.jpg # Original picture (the file name must be the same as the one uploaded to label studio)
│ └── b1.jpg
└── label_studio.json # Annotation file exported from label studio
```
- Extraction task
```shell
python label_studio.py \
--label_studio_file ./document/data/label_studio.json \
--save_dir ./document/data \
--splits 0.8 0.1 0.1 \
--task_type ext
```
- Document classification tasks
```shell
python label_studio.py \
--label_studio_file ./document/data/label_studio.json \
--save_dir ./document/data \
--splits 0.8 0.1 0.1 \
--task_type cls \
--prompt_prefix "document category" \
--options "invoice" "customs declaration"
```
<a name="27"></a>
#### 2.7 More Configuration
- ``label_studio_file``: Data labeling file exported from label studio.
- ``save_dir``: The storage directory of the training data, which is stored in the ``data`` directory by default.
- ``negative_ratio``: The maximum negative ratio. This parameter is only valid for extraction tasks. Properly constructing negative examples can improve the model effect. The number of negative examples is related to the actual number of labels, the maximum number of negative examples = negative_ratio * number of positive examples. This parameter is only valid for the training set, and the default is 5. In order to ensure the accuracy of the evaluation indicators, the verification set and test set are constructed with all negative examples by default.
- ``splits``: The proportion of training set and validation set when dividing the data set. The default is [0.8, 0.1, 0.1], which means that the data is divided into training set, verification set and test set according to the ratio of ``8:1:1``.
- ``task_type``: Select the task type, there are two types of tasks: extraction and classification.
- ``options``: Specify the category label of the classification task, this parameter is only valid for the classification type task. Defaults to ["positive", "negative"].
- ``prompt_prefix``: Declare the prompt prefix information of the classification task, this parameter is only valid for the classification type task. Defaults to "Sentimental Tendency".
- ``is_shuffle``: Whether to randomly shuffle the data set, the default is True.
- ``seed``: random seed, default is 1000.
- ``separator``: The separator between entity category/evaluation dimension and classification label. This parameter is only valid for entity/evaluation dimension classification tasks. The default is"##".
- ``schema_lang``: Select the language of the schema, which will be the construction method of the training data prompt, optional `ch` and `en`. Defaults to `ch`.
- ``ocr_lang``: Select the language for OCR, optional `ch` and `en`. Defaults to `ch`.
- ``layout_analysis``: Whether to use PPStructure to analyze the layout of the document. This parameter is only valid for document type labeling tasks. The default is False.
Note:
- By default the [label_studio.py](./label_studio.py) script will divide the data proportionally into train/dev/test datasets
- Each time the [label_studio.py](./label_studio.py) script is executed, the existing data file with the same name will be overwritten
- In the model training phase, we recommend constructing some negative examples to improve the model performance, and we have built-in this function in the data conversion phase. The proportion of automatically constructed negative samples can be controlled by `negative_ratio`; the number of negative samples = negative_ratio * the number of positive samples.
- For files exported from label_studio, each piece of data in the default file is correctly labeled manually.
## References
- **[Label Studio](https://labelstud.io/)**
# 文本抽取任务Label Studio使用指南
**目录**
- [1. 安装](#1)
- [2. 文本抽取任务标注](#2)
- [2.1 项目创建](#21)
- [2.2 数据上传](#22)
- [2.3 标签构建](#23)
- [2.4 任务标注](#24)
- [2.5 数据导出](#25)
- [2.6 数据转换](#26)
- [2.7 更多配置](#27)
<a name="1"></a>
## 1. 安装
**以下标注示例用到的环境配置:**
- Python 3.8+
- label-studio == 1.6.0
- paddleocr >= 2.6.0.1
在终端(terminal)使用pip安装label-studio:
```shell
pip install label-studio==1.6.0
```
安装完成后,运行以下命令行:
```shell
label-studio start
```
在浏览器打开[http://localhost:8080/](http://127.0.0.1:8080/),输入用户名和密码登录,开始使用label-studio进行标注。
<a name="2"></a>
## 2. 文本抽取任务标注
<a name="21"></a>
#### 2.1 项目创建
点击创建(Create)开始创建一个新的项目,填写项目名称、描述,然后选择``Object Detection with Bounding Boxes``
- 填写项目名称、描述
<div align="center">
<img src=https://user-images.githubusercontent.com/40840292/199661377-d9664165-61aa-4462-927d-225118b8535b.png height=230 width=1200 />
</div>
- **命名实体识别、关系抽取、事件抽取、实体/评价维度分类**任务选择``Relation Extraction`。
<div align="center">
<img src=https://user-images.githubusercontent.com/40840292/199661638-48a870eb-a1df-4db5-82b9-bc8e985f5190.png height=350 width=1200 />
</div>
- **文本分类、句子级情感倾向分类**任务选择``Text Classification``。
<div align="center">
<img src=https://user-images.githubusercontent.com/40840292/212617773-34534e68-4544-4b24-8f39-ae7f9573d397.png height=420 width=1200 />
</div>
- 添加标签(也可跳过后续在Setting/Labeling Interface中配置)
<div align="center">
<img src=https://user-images.githubusercontent.com/40840292/199662737-ed996a2c-7a24-4077-8a36-239c4bfb0a16.png height=380 width=1200 />
</div>
图中展示了实体类型标签的构建,其他类型标签的构建可参考[2.3标签构建](#23)
<a name="22"></a>
#### 2.2 数据上传
先从本地上传txt格式文件,选择``List of tasks``,然后选择导入本项目。
<div align="center">
<img src=https://user-images.githubusercontent.com/40840292/199667670-1b8f6755-b41f-41c4-8afc-06bb051690b6.png height=210 width=1200 />
</div>
<a name="23"></a>
#### 2.3 标签构建
- Span类型标签
<div align="center">
<img src=https://user-images.githubusercontent.com/40840292/199667941-04e300c5-3cd7-4b8e-aaf5-561415414891.png height=480 width=1200 />
</div>
- Relation类型标签
<div align="center">
<img src=https://user-images.githubusercontent.com/40840292/199725229-f5e998bf-367c-4449-b83a-c799f1e3de00.png height=620 width=1200 />
</div>
Relation XML模板:
```xml
<Relations>
<Relation value="歌手"/>
<Relation value="发行时间"/>
<Relation value="所属专辑"/>
</Relations>
```
- 分类类别标签
<div align="center">
<img src=https://user-images.githubusercontent.com/40840292/199724082-ee82dceb-dab0-496d-a930-a8ecb284d8b2.png height=370 width=1200 />
</div>
<a name="24"></a>
#### 2.4 任务标注
- 实体抽取
标注示例:
<div align="center">
<img src=https://user-images.githubusercontent.com/40840292/199879957-aeec9d17-d342-4ea0-a840-457b49f6066e.png height=140 width=1000 />
</div>
该标注示例对应的schema为:
```text
schema = [
'时间',
'选手',
'赛事名称',
'得分'
]
```
- 关系抽取
<div align="center">
<img src=https://user-images.githubusercontent.com/40840292/199879866-03c1ecac-1828-4f35-af70-9ae61701c303.png height=230 width=1200 />
</div>
对于关系抽取,其P的类型设置十分重要,需要遵循以下原则
“{S}的{P}为{O}”需要能够构成语义合理的短语。比如对于三元组(S, 父子, O),关系类别为父子是没有问题的。但按照UIE当前关系类型prompt的构造方式,“S的父子为O”这个表达不是很通顺,因此P改成孩子更好,即“S的孩子为O”。**合理的P类型设置,将显著提升零样本效果**。
该标注示例对应的schema为:
```text
schema = {
'作品名': [
'歌手',
'发行时间',
'所属专辑'
]
}
```
- 事件抽取
<div align="center">
<img src=https://user-images.githubusercontent.com/40840292/199879776-75abbade-9bea-44dc-ac36-322fecdc03e0.png height=220 width=1200 />
</div>
该标注示例对应的schema为:
```text
schema = {
'地震触发词': [
'时间',
'震级'
]
}
```
- 句子级分类
<div align="center">
<img src=https://user-images.githubusercontent.com/40840292/199879672-c3f286fe-a217-4888-950f-d4ee45b19f5a.png height=210 width=1000 />
</div>
该标注示例对应的schema为:
```text
schema = '情感倾向[正向,负向]'
```
- 实体/评价维度分类
<div align="center">
<img src=https://user-images.githubusercontent.com/40840292/199879586-8c6e4826-a3b0-49e0-9920-98ca062dccff.png height=240 width=1200 />
</div>
该标注示例对应的schema为:
```text
schema = {
'评价维度': [
'观点词',
'情感倾向[正向,负向]'
]
}
```
<a name="25"></a>
#### 2.5 数据导出
勾选已标注文本ID,选择导出的文件类型为``JSON``,导出数据:
<div align="center">
<img src=https://user-images.githubusercontent.com/40840292/199891344-023736e2-6f9d-454b-b72a-dec6689f8436.png height=180 width=1200 />
</div>
<a name="26"></a>
#### 2.6 数据转换
将导出的文件重命名为``label_studio.json``后,放入``./data``目录下。通过[label_studio.py](./label_studio.py)脚本可转为UIE的数据格式。
- 抽取式任务
```shell
python label_studio.py \
--label_studio_file ./data/label_studio.json \
--save_dir ./data \
--splits 0.8 0.1 0.1 \
--task_type ext
```
- 句子级分类任务
在数据转换阶段,我们会自动构造用于模型训练的prompt信息。例如句子级情感分类中,prompt为``情感倾向[正向,负向]``,可以通过`prompt_prefix`和`options`参数进行配置。
```shell
python label_studio.py \
--label_studio_file ./data/label_studio.json \
--task_type cls \
--save_dir ./data \
--splits 0.8 0.1 0.1 \
--prompt_prefix "情感倾向" \
--options "正向" "负向"
```
- 实体/评价维度分类任务
在数据转换阶段,我们会自动构造用于模型训练的prompt信息。例如评价维度情感分类中,prompt为``XXX的情感倾向[正向,负向]``,可以通过`prompt_prefix`和`options`参数进行声明。
```shell
python label_studio.py \
--label_studio_file ./data/label_studio.json \
--task_type ext \
--save_dir ./data \
--splits 0.8 0.1 0.1 \
--prompt_prefix "情感倾向" \
--options "正向" "负向" \
--separator "##"
```
<a name="27"></a>
#### 2.7 更多配置
- ``label_studio_file``: 从label studio导出的数据标注文件。
- ``save_dir``: 训练数据的保存目录,默认存储在``data``目录下。
- ``negative_ratio``: 最大负例比例,该参数只对抽取类型任务有效,适当构造负例可提升模型效果。负例数量和实际的标签数量有关,最大负例数量 = negative_ratio * 正例数量。该参数只对训练集有效,默认为5。为了保证评估指标的准确性,验证集和测试集默认构造全负例。
- ``splits``: 划分数据集时训练集、验证集所占的比例。默认为[0.8, 0.1, 0.1]表示按照``8:1:1``的比例将数据划分为训练集、验证集和测试集。
- ``task_type``: 选择任务类型,可选有抽取和分类两种类型的任务。
- ``options``: 指定分类任务的类别标签,该参数只对分类类型任务有效。默认为["正向", "负向"]。
- ``prompt_prefix``: 声明分类任务的prompt前缀信息,该参数只对分类类型任务有效。默认为"情感倾向"。
- ``is_shuffle``: 是否对数据集进行随机打散,默认为True。
- ``seed``: 随机种子,默认为1000.
- ``schema_lang``:选择schema的语言,将会应该训练数据prompt的构造方式,可选有`ch`和`en`。默认为`ch`。
- ``separator``: 实体类别/评价维度与分类标签的分隔符,该参数只对实体/评价维度分类任务有效。默认为"##"。
备注:
- 默认情况下 [label_studio.py](./label_studio.py) 脚本会按照比例将数据划分为 train/dev/test 数据集
- 每次执行 [label_studio.py](./label_studio.py) 脚本,将会覆盖已有的同名数据文件
- 在模型训练阶段我们推荐构造一些负例以提升模型效果,在数据转换阶段我们内置了这一功能。可通过`negative_ratio`控制自动构造的负样本比例;负样本数量 = negative_ratio * 正样本数量。
- 对于从label_studio导出的文件,默认文件中的每条数据都是经过人工正确标注的。
## References
- **[Label Studio](https://labelstud.io/)**
# Label Studio User Guide - Text Information Extraction
**Table of contents**
- [1. Installation](#1)
- [2. Text Extraction Task Annotation](#2)
- [2.1 Project Creation](#21)
- [2.2 Data Upload](#22)
- [2.3 Label Construction](#23)
- [2.4 Task Annotation](#24)
- [2.5 Data Export](#25)
- [2.6 Data Conversion](#26)
- [2.7 More Configuration](#27)
<a name="1"></a>
## 1. Installation
**Environmental configuration used in the following annotation examples:**
- Python 3.8+
- label-studio == 1.6.0
- paddleocr >= 2.6.0.1
Use pip to install label-studio in the terminal:
```shell
pip install label-studio==1.6.0
```
Once the installation is complete, run the following command line:
```shell
label-studio start
```
Open [http://localhost:8080/](http://127.0.0.1:8080/) in the browser, enter the user name and password to log in, and start using label-studio for labeling.
<a name="2"></a>
## 2. Text extraction task annotation
<a name="21"></a>
#### 2.1 Project Creation
Click Create to start creating a new project, fill in the project name, description, and select ``Object Detection with Bounding Boxes``.
- Fill in the project name, description
<div align="center">
<img src=https://user-images.githubusercontent.com/40840292/199661377-d9664165-61aa-4462-927d-225118b8535b.png height=230 width=1200 />
</div>
- For **Named Entity Recognition, Relation Extraction, Event Extraction, Opinion Extraction** tasks please select ``Relation Extraction`.
<div align="center">
<img src=https://user-images.githubusercontent.com/40840292/199661638-48a870eb-a1df-4db5-82b9-bc8e985f5190.png height=350 width=1200 />
</div>
- For **Text classification, Sentence-level sentiment classification** tasks please select ``Text Classification``.
<div align="center">
<img src=https://user-images.githubusercontent.com/40840292/212617773-34534e68-4544-4b24-8f39-ae7f9573d397.png height=420 width=1200 />
</div>
- Define labels
<div align="center">
<img src=https://user-images.githubusercontent.com/40840292/199662737-ed996a2c-7a24-4077-8a36-239c4bfb0a16.png height=380 width=1200 />
</div>
The figure shows the construction of entity type tags, and the construction of other types of tags can refer to [2.3 Label Construction](#23)
<a name="22"></a>
#### 2.2 Data upload
First upload the txt format file locally, select ``List of tasks``, and then choose to import this project.
<div align="center">
<img src=https://user-images.githubusercontent.com/40840292/199667670-1b8f6755-b41f-41c4-8afc-06bb051690b6.png height=210 width=1200 />
</div>
<a name="23"></a>
#### 2.3 Label construction
- Entity label
<div align="center">
<img src=https://user-images.githubusercontent.com/40840292/199667941-04e300c5-3cd7-4b8e-aaf5-561415414891.png height=480 width=1200 />
</div>
- Relation label
<div align="center">
<img src=https://user-images.githubusercontent.com/40840292/199725229-f5e998bf-367c-4449-b83a-c799f1e3de00.png height=620 width=1200 />
</div>
Relation XML template:
```xml
<Relations>
<Relation value="Singer"/>
<Relation value="Published"/>
<Relation value="Album"/>
</Relations>
```
- Classification label
<div align="center">
<img src=https://user-images.githubusercontent.com/40840292/199724082-ee82dceb-dab0-496d-a930-a8ecb284d8b2.png height=370 width=1200 />
</div>
<a name="24"></a>
#### 2.4 Task annotation
- Entity extraction
Callout example:
<div align="center">
<img src=https://user-images.githubusercontent.com/40840292/199879957-aeec9d17-d342-4ea0-a840-457b49f6066e.png height=140 width=1000 />
</div>
The schema corresponding to this annotation example is:
```text
schema = [
'时间',
'选手',
'赛事名称',
'得分'
]
```
- Relation extraction
<div align="center">
<img src=https://user-images.githubusercontent.com/40840292/199879866-03c1ecac-1828-4f35-af70-9ae61701c303.png height=230 width=1200 />
</div>
For relation extraction, the type setting of P is very important, and the following principles need to be followed
"{P} of {S} is {O}" needs to be able to form a semantically reasonable phrase. For example, for a triple (S, father and son, O), there is no problem with the relation category being father and son. However, according to the current structure of the UIE relation type prompt, the expression "the father and son of S is O" is not very smooth, so it is better to change P to child, that is, "child of S is O". **A reasonable P type setting will significantly improve the zero-shot performance**.
The schema corresponding to this annotation example is:
```text
schema = {
'作品名': [
'歌手',
'发行时间',
'所属专辑'
]
}
```
- Event extraction
<div align="center">
<img src=https://user-images.githubusercontent.com/40840292/199879776-75abbade-9bea-44dc-ac36-322fecdc03e0.png height=220 width=1200 />
</div>
The schema corresponding to this annotation example is:
```text
schema = {
'地震触发词': [
'时间',
'震级'
]
}
```
- Sentence level classification
<div align="center">
<img src=https://user-images.githubusercontent.com/40840292/199879672-c3f286fe-a217-4888-950f-d4ee45b19f5a.png height=210 width=1000 />
</div>
The schema corresponding to this annotation example is:
```text
schema = '情感倾向[正向,负向]'
```
- Opinion Extraction
<div align="center">
<img src=https://user-images.githubusercontent.com/40840292/199879586-8c6e4826-a3b0-49e0-9920-98ca062dccff.png height=240 width=1200 />
</div>
The schema corresponding to this annotation example is:
```text
schema = {
'评价维度': [
'观点词',
'情感倾向[正向,负向]'
]
}
```
<a name="25"></a>
#### 2.5 Data Export
Check the marked text ID, select the exported file type as ``JSON``, and export the data:
<div align="center">
<img src=https://user-images.githubusercontent.com/40840292/199891344-023736e2-6f9d-454b-b72a-dec6689f8436.png height=180 width=1200 />
</div>
<a name="26"></a>
#### 2.6 Data conversion
Rename the exported file to ``label_studio.json`` and put it in the ``./data`` directory. Through the [label_studio.py](./label_studio.py) script, it can be converted to the data format of UIE.
- Extraction task
```shell
python label_studio.py\
--label_studio_file ./data/label_studio.json \
--save_dir ./data \
--splits 0.8 0.1 0.1 \
--task_type ext
```
- Sentence-level classification tasks
In the data conversion stage, we will automatically construct prompt information for model training. For example, in sentence-level sentiment classification, the prompt is ``Sentiment Classification [positive, negative]``, which can be configured through `prompt_prefix` and `options` parameters.
```shell
python label_studio.py\
--label_studio_file ./data/label_studio.json \
--task_type cls \
--save_dir ./data \
--splits 0.8 0.1 0.1 \
--prompt_prefix "Sentiment Classification" \
--options "positive" "negative"
```
- Opinion Extraction
In the data conversion stage, we will automatically construct prompt information for model training. For example, in the emotional classification of the evaluation dimension, the prompt is ``Sentiment Classification of xxx [positive, negative]``, which can be declared through the `prompt_prefix` and `options` parameters.
```shell
python label_studio.py\
--label_studio_file ./data/label_studio.json \
--task_type ext \
--save_dir ./data \
--splits 0.8 0.1 0.1 \
--prompt_prefix "Sentiment Classification" \
--options "positive" "negative" \
--separator "##"
```
<a name="27"></a>
#### 2.7 More Configuration
- ``label_studio_file``: Data labeling file exported from label studio.
- ``save_dir``: The storage directory of the training data, which is stored in the ``data`` directory by default.
- ``negative_ratio``: The maximum negative ratio. This parameter is only valid for extraction tasks. Properly constructing negative examples can improve the model effect. The number of negative examples is related to the actual number of labels, the maximum number of negative examples = negative_ratio * number of positive examples. This parameter is only valid for the training set, and the default is 5. In order to ensure the accuracy of the evaluation indicators, the verification set and test set are constructed with all negative examples by default.
- ``splits``: The proportion of training set and validation set when dividing the data set. The default is [0.8, 0.1, 0.1], which means that the data is divided into training set, verification set and test set according to the ratio of ``8:1:1``.
- ``task_type``: Select the task type, there are two types of tasks: extraction and classification.
- ``options``: Specify the category label of the classification task, this parameter is only valid for the classification type task. Defaults to ["positive", "negative"].
- ``prompt_prefix``: Declare the prompt prefix information of the classification task, this parameter is only valid for the classification type task. Defaults to "Sentimental Tendency".
- ``is_shuffle``: Whether to randomly shuffle the data set, the default is True.
- ``seed``: random seed, default is 1000.
- ``schema_lang``: Select the language of the schema, which will be the construction method of the training data prompt, optional `ch` and `en`. Defaults to `ch`.
- ``separator``: The separator between entity category/evaluation dimension and classification label. This parameter is only valid for entity/evaluation dimension classification tasks. The default is"##".
Note:
- By default the [label_studio.py](./label_studio.py) script will divide the data proportionally into train/dev/test datasets
- Each time the [label_studio.py](./label_studio.py) script is executed, the existing data file with the same name will be overwritten
- In the model training phase, we recommend constructing some negative examples to improve the model performance, and we have built-in this function in the data conversion phase. The proportion of automatically constructed negative samples can be controlled by `negative_ratio`; the number of negative samples = negative_ratio * the number of positive samples.
- For files exported from label_studio, each piece of data in the default file is correctly labeled manually.
## References
- **[Label Studio](https://labelstud.io/)**
# UIE Taskflow使用指南
**目录**
- [1. 功能简介](#1)
- [2. 文档信息抽取](#2)
- [2.1 实体抽取](#21)
- [2.2 关系抽取](#22)
- [2.3 跨任务使用](#23)
- [2.4 输入说明](#24)
- [2.5 使用技巧](#25)
- [2.6 结果可视化](#26)
- [2.7 更多配置](#27)
<a name="1"></a>
## 1. 功能简介
```paddlenlp.Taskflow```提供文本及文档的通用信息抽取、评价观点抽取等能力,可抽取多种类型的信息,包括但不限于命名实体识别(如人名、地名、机构名等)、关系(如电影的导演、歌曲的发行时间等)、事件(如某路口发生车祸、某地发生地震等)、以及评价维度、观点词、情感倾向等信息。用户可以使用自然语言自定义抽取目标,无需训练即可统一抽取输入文本或文档中的对应信息。**实现开箱即用,并满足各类信息抽取需求**
<a name="2"></a>
## 2. 文档信息抽取
本章节主要介绍Taskflow的文档抽取功能,以下示例图片[下载链接](https://bj.bcebos.com/paddlenlp/taskflow/information_extraction/cases.zip)。
<a name="21"></a>
#### 2.1 实体抽取
实体抽取,又称命名实体识别(Named Entity Recognition,简称NER),是指识别文本中具有特定意义的实体。在开放域信息抽取中,抽取的类别没有限制,用户可以自己定义。
- 报关单
<div align="center">
<img src=https://user-images.githubusercontent.com/40840292/206112148-82e26dad-4a77-40e3-bc11-f877047aeb87.png height=700 width=450 hspace='10'/>
</div>
```python
>>> from pprint import pprint
>>> from paddlenlp import Taskflow
>>> schema = ["收发货人", "进口口岸", "进口日期", "运输方式", "征免性质", "境内目的地", "运输工具名称", "包装种类", "件数", "合同协议号"]
>>> ie = Taskflow("information_extraction", schema=schema, model="uie-x-base")
>>> pprint(ie({"doc": "./cases/custom.jpeg"}))
[{'件数': [{'bbox': [[826, 1062, 926, 1121]],
'end': 312,
'probability': 0.9832498761402597,
'start': 308,
'text': '1142'}],
'包装种类': [{'bbox': [[1214, 1066, 1310, 1121]],
'end': 314,
'probability': 0.9995648138860567,
'start': 312,
'text': '纸箱'}],
'合同协议号': [{'bbox': [[151, 1077, 258, 1117]],
'end': 319,
'probability': 0.9984179437542124,
'start': 314,
'text': '33035'}],
'境内目的地': [{'bbox': [[1966, 872, 2095, 923]],
'end': 275,
'probability': 0.9975541483111243,
'start': 272,
'text': '上海市'}],
'征免性质': [{'bbox': [[1583, 770, 1756, 821]],
'end': 242,
'probability': 0.9950633161231508,
'start': 238,
'text': '一般征税'}],
'收发货人': [{'bbox': [[321, 533, 841, 580]],
'end': 95,
'probability': 0.4772132061042136,
'start': 82,
'text': '上海新尚实国际贸易有限公司'},
{'bbox': [[306, 584, 516, 624]],
'end': 150,
'probability': 0.33807074572195006,
'start': 140,
'text': '31222609K9'}],
'运输工具名称': [{'bbox': [[1306, 672, 1516, 712], [1549, 668, 1645, 712]],
'end': 190,
'probability': 0.6692050414718089,
'start': 174,
'text': 'E. R. TIANAN004E'}],
'运输方式': [{'bbox': [[1070, 664, 1240, 715]],
'end': 174,
'probability': 0.9994416347044179,
'start': 170,
'text': '永路运输'}],
'进口口岸': [{'bbox': [[1070, 566, 1346, 617]],
'end': 120,
'probability': 0.9945697196994345,
'start': 111,
'text': '洋山港区-2248'}],
'进口日期': [{'bbox': [[1726, 569, 1933, 610]],
'end': 130,
'probability': 0.9804819494073627,
'start': 120,
'text': '2017-02-24'}]}]
```
- 证件
<div align="center">
<img src=https://user-images.githubusercontent.com/40840292/206114081-8c82e2a2-0c88-4ca3-9651-b12c94266be9.png height=400 width=700 hspace='10'/>
</div>
```python
>>> from pprint import pprint
>>> from paddlenlp import Taskflow
>>> schema = ["Name", "Date of birth", "Issue date"]
>>> ie = Taskflow("information_extraction", schema=schema, model="uie-x-base", ocr_lang="en", schema_lang="en")
>>> pprint(ie({"doc": "./cases/license.jpeg"}))
```
<a name="22"></a>
#### 2.2 关系抽取
关系抽取(Relation Extraction,简称RE),是指从文本中识别实体并抽取实体之间的语义关系,进而获取三元组信息,即<主体,谓语,客体>。
- 表格
<div align="center">
<img src=https://user-images.githubusercontent.com/40840292/206115688-30de315a-8fd4-4125-a3c3-8cb05c6e39e5.png height=180 width=600 hspace='10'/>
</div>
```python
>>> from pprint import pprint
>>> from paddlenlp import Taskflow
>>> schema = {"姓名": ["招聘单位", "报考岗位"]}
>>>> ie = Taskflow("information_extraction", schema=schema, model="uie-x-base")
>>> pprint(ie({"doc": "./cases/table.png"}))
```
<a name="23"></a>
#### 2.3 跨任务使用
- 实体、关系多任务抽取
对文档进行实体+关系抽取,schema构造如下:
```text
schema = [
"Total GBP",
"No.",
"Date",
"Customer No.",
"Subtotal without VAT",
{
"Description": [
"Quantity",
"Amount"
]
}
]
```
<div align="center">
<img src=https://user-images.githubusercontent.com/40840292/206120861-13b475dc-9a78-43bc-9dec-91f331db2ddf.png height=400 width=650 hspace='10'/>
</div>
```python
>>> from pprint import pprint
>>> from paddlenlp import Taskflow
>>> schema = ["Total GBP", "No.", "Date", "Customer No.", "Subtotal without VAT", {"Description": ["Quantity", "Amount"]}]
>>> ie = Taskflow("information_extraction", schema=schema, model="uie-x-base", ocr_lang="en", schema_lang="en")
>>> pprint(ie({"doc": "./cases/delivery_note.png"}))
```
<a name="24"></a>
#### 2.4 输入说明
- 输入格式
文档抽取UIE-X支持图片路径、http图片链接、base64的输入形式,支持图片和PDF两种文档格式。文本抽取可以通过`text`指定输入文本。
```python
[
{'text': '2月8日上午北京冬奥会自由式滑雪女子大跳台决赛中中国选手谷爱凌以188.25分获得金牌!'},
{'doc': './cases/custom.jpg'},
{'doc': 'https://user-images.githubusercontent.com/40840292/203457719-84a70241-607e-4bb1-ab4c-3d9beee9e254.jpeg'}
]
```
**NOTE**: 多页PDF输入目前只抽取第一页的结果,UIE-X比较适合单证文档(如票据、单据等)的信息提取,目前还不适合过长或多页的文档。
- 使用自己的layout / OCR作为输入
```python
layout = [
([68.0, 12.0, 167.0, 70.0], '名次'),
([464.0, 13.0, 559.0, 67.0], '球员'),
([833.0, 15.0, 1054.0, 64.0], '总出场时间'),
......
]
ie({"doc": doc_path, 'layout': layout})
```
<a name="25"></a>
#### 2.5 使用技巧
- 使用PP-Structure版面分析功能
OCR中识别出来的文字会按照左上到右下进行排序,对于分栏、表格内有多行文本等情况我们推荐使用版面分析功能``layout_analysis=True``以优化文字排序并增强抽取效果。以下例子仅举例版面分析功能的使用场景,实际场景一般需要标注微调。
<div align="center">
<img src=https://user-images.githubusercontent.com/40840292/206139057-aedec98f-683c-4648-999d-81ce5ea04a86.png height=250 width=500 hspace='10'/>
</div>
```python
>>> from pprint import pprint
>>> from paddlenlp import Taskflow
>>> schema = "中标候选人名称"
>>> ie = Taskflow("information_extraction", schema=schema, model="uie-x-base", layout_analysis=True)
>>> pprint(ie({"doc": "https://gimg2.baidu.com/image_search/src=http%3A%2F%2Fwww.xuyiwater.com%2Fwp-content%2Fuploads%2F2021%2F06%2F1-4.jpg&refer=http%3A%2F%2Fwww.xuyiwater.com&app=2002&size=f9999,10000&q=a80&n=0&g=0n&fmt=auto?sec=1672994926&t=2a4a3fedf6999a34ccde190f97bcfa47"}))
```
<div align="center">
<img src=https://user-images.githubusercontent.com/40840292/206137978-3a69e7e2-dc2e-4d11-98b7-25911b0375a0.png height=350 width=600 hspace='10'/>
</div>
```python
>>> schema = "抗血小板药物的用药指征"
>>> ie.set_schema(schema)
>>> pprint(ie({"doc": "./cases/drug.webp"}))
```
<a name="26"></a>
#### 2.6 结果可视化
- OCR识别结果可视化:
```python
>>> from paddlenlp.utils.doc_parser import DocParser
>>> doc_parser = DocParser(ocr_lang="en")
>>> doc_path = "./cases/business_card.png"
>>> parsed_doc = doc_parser.parse({"doc": doc_path})
>>> doc_parser.write_image_with_results(
doc_path,
layout=parsed_doc['layout'],
save_path="ocr_result.png")
```
<div align="center">
<img src=https://user-images.githubusercontent.com/40840292/206168103-0a37eab0-bb36-4eec-bd51-b3f85838b40c.png height=350 width=600 hspace='10'/>
</div>
- 抽取结果可视化:
```python
>>> from pprint import pprint
>>> from paddlenlp import Taskflow
>>> from paddlenlp.utils.doc_parser import DocParser
>>> doc_path = "./cases/business_card.png"
>>> schema = ["人名", "职位", "号码", "邮箱地址", "网址", "地址", "邮编"]
>>> ie = Taskflow("information_extraction", schema=schema, model="uie-x-base", ocr_lang="en")
>>> results = ie({"doc": doc_path})
>>> DocParser.write_image_with_results(
doc_path,
result=results[0],
save_path="image_show.png")
```
<div align="center">
<img src=https://user-images.githubusercontent.com/40840292/206168852-c32c34c4-f245-4116-a244-390e55c13383.png height=350 width=600 hspace='10'/>
</div>
<a name="27"></a>
#### 2.7 更多配置
```python
>>> from paddlenlp import Taskflow
>>> ie = Taskflow('information_extraction',
schema="",
schema_lang="ch",
ocr_lang="ch",
batch_size=16,
model='uie-x-base',
layout_analysis=False,
position_prob=0.5,
precision='fp32',
use_fast=False)
```
* `schema`:定义任务抽取目标,可参考开箱即用中不同任务的调用示例进行配置。
* `schema_lang`:设置schema的语言,默认为`ch`, 可选有`ch`和`en`。因为中英schema的构造有所不同,因此需要指定schema的语言。
* `ocr_lang`:选择PaddleOCR的语言,`ch`可在中英混合的图片中使用,`en`在英文图片上的效果更好,默认为`ch`。
* `batch_size`:批处理大小,请结合机器情况进行调整,默认为16。
* `model`:选择任务使用的模型,默认为`uie-base`,可选有`uie-base`, `uie-medium`, `uie-mini`, `uie-micro`, `uie-nano`和`uie-medical-base`, `uie-base-en`,`uie-x-base`。
* `layout_analysis`:是否使用PP-Structure对文档进行布局分析以优化布局信息的排序,默认为False。
* `position_prob`:模型对于span的起始位置/终止位置的结果概率在0~1之间,返回结果去掉小于这个阈值的结果,默认为0.5,span的最终概率输出为起始位置概率和终止位置概率的乘积。
* `precision`:选择模型精度,默认为`fp32`,可选有`fp16`和`fp32`。`fp16`推理速度更快,支持GPU和NPU硬件环境。如果选择`fp16`,在GPU硬件环境下,请先确保机器正确安装NVIDIA相关驱动和基础软件,**确保CUDA>=11.2,cuDNN>=8.1.1**,初次使用需按照提示安装相关依赖。其次,需要确保GPU设备的CUDA计算能力(CUDA Compute Capability)大于7.0,典型的设备包括V100、T4、A10、A100、GTX 20系列和30系列显卡等。更多关于CUDA Compute Capability和精度支持情况请参考NVIDIA文档:[GPU硬件与支持精度对照表](https://docs.nvidia.com/deeplearning/tensorrt/archives/tensorrt-840-ea/support-matrix/index.html#hardware-precision-matrix)。
* `use_fast`: 使用C++实现的高性能分词算子FastTokenizer进行文本预处理加速。需要通过`pip install fast-tokenizer-python`安装FastTokenizer库后方可使用。默认为`False`。更多使用说明可参考[FastTokenizer文档](../../fast_tokenizer)。
## References
- **[PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR)**
- **[PP-Structure](https://github.com/PaddlePaddle/PaddleOCR/tree/release/2.6/ppstructure)**
# UIE Taskflow User Guide
**Table of contents**
- [1. Introduction](#1)
- [2. Document Information Extraction](#2)
- [2.1 Entity Extraction](#21)
- [2.2 Relation Extraction](#22)
- [2.3 Multi-Task Extraction](#23)
- [2.4 Input Format](#24)
- [2.5 Tips](#25)
- [2.6 Visualization](#26)
- [2.7 More Configuration](#27)
<a name="1"></a>
## 1. Introduction
```paddlenlp.Taskflow``` provides general information extraction of text and documents, evaluation opinion extraction and other capabilities, and can extract various types of information, including but not limited to named entities (such as person name, place name, organization name, etc.), relations (such as the director of the movie, the release time of the song, etc.), events (such as a car accident at a certain intersection, an earthquake in a certain place, etc.), and information such as product reviews, opinions, and sentiments. Users can use natural language to customize the extraction target, and can uniformly extract the corresponding information in the input text or document without training.
<a name="2"></a>
## 2. Document Information Extraction
This section introduces the document extraction capability of Taskflow with the following example picture [download link](https://bj.bcebos.com/paddlenlp/taskflow/information_extraction/cases.zip).
<a name="21"></a>
#### 2.1 Entity Extraction
Entity extraction, also known as Named Entity Recognition (NER for short), refers to identifying entities with specific meanings in text. UIE adopts the open-domain approach where the entity category is not fixed and the users can define them by through natural language.
- Example: Customs Declaration Form
<div align="center">
<img src=https://user-images.githubusercontent.com/40840292/206112148-82e26dad-4a77-40e3-bc11-f877047aeb87.png height=700 width=450 hspace='10'/>
</div>
```python
>>> from pprint import pprint
>>> from paddlenlp import Taskflow
>>> schema = ["收发货人", "进口口岸", "进口日期", "运输方式", "征免性质", "境内目的地", "运输工具名称", "包装种类", "件数", "合同协议号"]
>>> ie = Taskflow("information_extraction", schema=schema, model="uie-x-base")
>>> pprint(ie({"doc": "./cases/custom.jpeg"}))
[{'件数': [{'bbox': [[826, 1062, 926, 1121]],
'end': 312,
'probability': 0.9832498761402597,
'start': 308,
'text': '1142'}],
'包装种类': [{'bbox': [[1214, 1066, 1310, 1121]],
'end': 314,
'probability': 0.9995648138860567,
'start': 312,
'text': '纸箱'}],
'合同协议号': [{'bbox': [[151, 1077, 258, 1117]],
'end': 319,
'probability': 0.9984179437542124,
'start': 314,
'text': '33035'}],
'境内目的地': [{'bbox': [[1966, 872, 2095, 923]],
'end': 275,
'probability': 0.9975541483111243,
'start': 272,
'text': '上海市'}],
'征免性质': [{'bbox': [[1583, 770, 1756, 821]],
'end': 242,
'probability': 0.9950633161231508,
'start': 238,
'text': '一般征税'}],
'收发货人': [{'bbox': [[321, 533, 841, 580]],
'end': 95,
'probability': 0.4772132061042136,
'start': 82,
'text': '上海新尚实国际贸易有限公司'},
{'bbox': [[306, 584, 516, 624]],
'end': 150,
'probability': 0.33807074572195006,
'start': 140,
'text': '31222609K9'}],
'运输工具名称': [{'bbox': [[1306, 672, 1516, 712], [1549, 668, 1645, 712]],
'end': 190,
'probability': 0.6692050414718089,
'start': 174,
'text': 'E. R. TIANAN004E'}],
'运输方式': [{'bbox': [[1070, 664, 1240, 715]],
'end': 174,
'probability': 0.9994416347044179,
'start': 170,
'text': '永路运输'}],
'进口口岸': [{'bbox': [[1070, 566, 1346, 617]],
'end': 120,
'probability': 0.9945697196994345,
'start': 111,
'text': '洋山港区-2248'}],
'进口日期': [{'bbox': [[1726, 569, 1933, 610]],
'end': 130,
'probability': 0.9804819494073627,
'start': 120,
'text': '2017-02-24'}]}]
```
- Example: Driver's License
<div align="center">
<img src=https://user-images.githubusercontent.com/40840292/206114081-8c82e2a2-0c88-4ca3-9651-b12c94266be9.png height=400 width=700 hspace='10'/>
</div>
```python
>>> from pprint import pprint
>>> from paddlenlp import Taskflow
>>> schema = ["Name", "Date of birth", "Issue date"]
>>> ie = Taskflow("information_extraction", schema=schema, model="uie-x-base", ocr_lang="en", schema_lang="en")
>>> pprint(ie({"doc": "./cases/license.jpeg"}))
```
<a name="22"></a>
#### 2.2 Relation Extraction
Relation Extraction refers to identifying entities from text and extracting the semantic relationship between entities, and then obtaining triple information, namely <subject, predicate, object>.
- Example: Extracting relations from a table
<div align="center">
<img src=https://user-images.githubusercontent.com/40840292/206115688-30de315a-8fd4-4125-a3c3-8cb05c6e39e5.png height=180 width=600 hspace='10'/>
</div>
```python
>>> from pprint import pprint
>>> from paddlenlp import Taskflow
>>> schema = {"姓名": ["招聘单位", "报考岗位"]}
>>>> ie = Taskflow("information_extraction", schema=schema, model="uie-x-base")
>>> pprint(ie({"doc": "./cases/table.png"}))
```
<a name="23"></a>
#### 2.3 Multi-Task Extraction
To extract entities and relation from documents simultaneously, you may set the schema structure as following:
```text
schema = [
"Total GBP",
"No.",
"Date",
"Customer No.",
"Subtotal without VAT",
{
"Description": [
"Quantity",
"Amount"
]
}
]
```
<div align="center">
<img src=https://user-images.githubusercontent.com/40840292/206120861-13b475dc-9a78-43bc-9dec-91f331db2ddf.png height=400 width=650 hspace='10'/>
</div>
```python
>>> from pprint import pprint
>>> from paddlenlp import Taskflow
>>> schema = ["Total GBP", "No.", "Date", "Customer No.", "Subtotal without VAT", {"Description": ["Quantity", "Amount"]}]
>>> ie = Taskflow("information_extraction", schema=schema, model="uie-x-base", ocr_lang="en", schema_lang="en")
>>> pprint(ie({"doc": "./cases/delivery_note.png"}))
```
<a name="24"></a>
#### 2.4 Input Format
For document information extraction, UIE-X supports image paths, http image links, base64 input form, and image and PDF document formats. In the input dict, `text` indicates text input and `doc` refer to the document input.
```python
[
{'text': '2月8日上午北京冬奥会自由式滑雪女子大跳台决赛中中国选手谷爱凌以188.25分获得金牌!'},
{'doc': './cases/custom.jpg'},
{'doc': 'https://user-images.githubusercontent.com/40840292/203457719-84a70241-607e-4bb1-ab4c-3d9beee9e254.jpeg'}
]
```
**NOTE**: Multi-page PDF input currently only extracts the results of the first page. UIE-X is more suitable for information extraction of document documents (such as bills, receipts, etc.), but it is not suitable for documents that are too long or multi-page.
- Using custom OCR input
```python
layout = [
([68.0, 12.0, 167.0, 70.0], '名次'),
([464.0, 13.0, 559.0, 67.0], '球员'),
([833.0, 15.0, 1054.0, 64.0], '总出场时间'),
......
]
ie({"doc": doc_path, 'layout': layout})
```
<a name="25"></a>
#### 2.5 Tips
- Using PP-Structure layout analysis function
The text recognized in OCR will be sorted from top left to bottom right. For cases such as column division and multiple lines of text in the table, we recommend using the layout analysis function ``layout_analysis=True`` to optimize text sorting and enhance the extraction effect. The following example is only an example of the usage scenario of the layout analysis function, and the actual scenario generally needs to be marked and fine-tuned.
<div align="center">
<img src=https://user-images.githubusercontent.com/40840292/206139057-aedec98f-683c-4648-999d-81ce5ea04a86.png height=250 width=500 hspace='10'/>
</div>
```python
>>> from pprint import pprint
>>> from paddlenlp import Taskflow
>>> schema = "中标候选人名称"
>>> ie = Taskflow("information_extraction", schema=schema, model="uie-x-base", layout_analysis=True)
>>> pprint(ie({"doc": "https://gimg2.baidu.com/image_search/src=http%3A%2F%2Fwww.xuyiwater.com%2Fwp-content%2Fuploads%2F2021%2F06%2F1-4.jpg&refer=http%3A%2F%2Fwww.xuyiwater.com&app=2002&size=f9999,10000&q=a80&n=0&g=0n&fmt=auto?sec=1672994926&t=2a4a3fedf6999a34ccde190f97bcfa47"}))
```
<div align="center">
<img src=https://user-images.githubusercontent.com/40840292/206137978-3a69e7e2-dc2e-4d11-98b7-25911b0375a0.png height=350 width=600 hspace='10'/>
</div>
```python
>>> schema = "抗血小板药物的用药指征"
>>> ie.set_schema(schema)
>>> pprint(ie({"doc": "./cases/drug.webp"}))
```
<a name="26"></a>
#### 2.6 Visualization
- Visualization of OCR recognition results:
```python
>>> from paddlenlp.utils.doc_parser import DocParser
>>> doc_parser = DocParser(ocr_lang="en")
>>> doc_path = "./cases/business_card.png"
>>> parsed_doc = doc_parser.parse({"doc": doc_path})
>>> doc_parser.write_image_with_results(
doc_path,
layout=parsed_doc['layout'],
save_path="ocr_result.png")
```
<div align="center">
<img src=https://user-images.githubusercontent.com/40840292/206168103-0a37eab0-bb36-4eec-bd51-b3f85838b40c.png height=350 width=600 hspace='10'/>
</div>
- Visualization of extraction results:
```python
>>> from pprint import pprint
>>> from paddlenlp import Taskflow
>>> from paddlenlp.utils.doc_parser import DocParser
>>> doc_path = "./cases/business_card.png"
>>> schema = ["人名", "职位", "号码", "邮箱地址", "网址", "地址", "邮编"]
>>> ie = Taskflow("information_extraction", schema=schema, model="uie-x-base", ocr_lang="en")
>>> results = ie({"doc": doc_path})
>>> DocParser.write_image_with_results(
doc_path,
result=results[0],
save_path="image_show.png")
```
<div align="center">
<img src=https://user-images.githubusercontent.com/40840292/206168852-c32c34c4-f245-4116-a244-390e55c13383.png height=350 width=600 hspace='10'/>
</div>
<a name="27"></a>
#### 2.7 More Configuration
```python
>>> from paddlenlp import Taskflow
>>> ie = Taskflow('information_extraction',
schema="",
schema_lang="ch",
ocr_lang="ch",
batch_size=16,
model='uie-x-base',
layout_analysis=False,
position_prob=0.5,
precision='fp32',
use_fast=False)
```
* `schema`: Define the task extraction target, which can be configured by referring to the calling examples of different tasks in the out-of-the-box.
* `schema_lang`: Set the language of the schema, the default is `ch`, optional `ch` and `en`. Because the structure of the Chinese and English schemas is different, the language of the schema needs to be specified.
* `ocr_lang`: Select the language of PaddleOCR, `ch` can be used in mixed Chinese and English images, `en` works better on English images, the default is `ch`.
* `batch_size`: batch size, please adjust according to the machine situation, the default is 16.
* `model`: select the model used by the task, the default is `uie-base`, optional `uie-base`, `uie-medium`, `uie-mini`, `uie-micro`, `uie-nano` ` and `uie-medical-base`, `uie-base-en`, `uie-x-base`.
* `layout_analysis`: Whether to use PP-Structure to analyze the layout of the document to optimize the sorting of layout information, the default is False.
* `position_prob`: The result probability of the model for the start position/end position of the span is between 0 and 1, and the returned result removes the results less than this threshold, the default is 0.5, and the final probability output of the span is the start position probability and end position The product of the position probabilities.
* `precision`: select the model precision, the default is `fp32`, optional `fp16` and `fp32`. `fp16` inference is faster, support GPU and NPU hardware. If you choose `fp16` and GPU hardware, please ensure that the machine is correctly installed with NVIDIA-related drivers and basic software. **Ensure that CUDA>=11.2, cuDNN>=8.1.1**. For the first time use, you need to follow the prompts to install the relevant dependencies. Secondly, it is necessary to ensure that the CUDA Compute Capability of the GPU device is greater than 7.0. Typical devices include V100, T4, A10, A100, GTX 20 series and 30 series graphics cards, etc. For more information about CUDA Compute Capability and precision support, please refer to NVIDIA documentation: [GPU Hardware and Supported Precision Comparison Table](https://docs.nvidia.com/deeplearning/tensorrt/archives/tensorrt-840-ea/support-matrix/index.html#hardware-precision-matrix).
* `use_fast`: Use the high-performance word segmentation operator FastTokenizer implemented in C++ to accelerate text preprocessing. The FastTokenizer library needs to be installed through `pip install fast-tokenizer-python` before it can be used. Defaults to `False`. For more usage instructions, please refer to [FastTokenizer Documentation](../../fast_tokenizer).
## References
- **[PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR)**
- **[PP-Structure](https://github.com/PaddlePaddle/PaddleOCR/tree/release/2.6/ppstructure)**
# 文本抽取任务UIE Taskflow使用指南
**目录**
- [1. 功能简介](#1)
- [2. 应用示例](#2)
- [3. 文本信息抽取](#3)
- [3.1 实体抽取](#31)
- [3.2 关系抽取](#32)
- [3.3 事件抽取](#33)
- [3.4 评论观点抽取](#34)
- [3.5 情感分类](#35)
- [3.6 跨任务抽取](#36)
- [3.7 模型选择](#37)
- [3.8 更多配置](#38)
<a name="1"></a>
## 1. 功能简介
```paddlenlp.Taskflow```提供纯文本的通用信息抽取、评价观点抽取等能力,可抽取多种类型的信息,包括但不限于命名实体识别(如人名、地名、机构名等)、关系(如电影的导演、歌曲的发行时间等)、事件(如某路口发生车祸、某地发生地震等)、以及评价维度、观点词、情感倾向等信息。用户可以使用自然语言自定义抽取目标,无需训练即可统一抽取输入文本中的对应信息。**实现开箱即用,并满足各类信息抽取需求**
<a name="2"></a>
## 2. 应用示例
UIE不限定行业领域和抽取目标,以下是一些通过Taskflow实现开箱即用的行业示例:
- 医疗场景-专病结构化
![image](https://user-images.githubusercontent.com/40840292/169017581-93c8ee44-856d-4d17-970c-b6138d10f8bc.png)
- 法律场景-判决书抽取
![image](https://user-images.githubusercontent.com/40840292/169017863-442c50f1-bfd4-47d0-8d95-8b1d53cfba3c.png)
- 金融场景-收入证明、招股书抽取
![image](https://user-images.githubusercontent.com/40840292/169017982-e521ddf6-d233-41f3-974e-6f40f8f2edbc.png)
- 公安场景-事故报告抽取
![image](https://user-images.githubusercontent.com/40840292/169018340-31efc1bf-f54d-43f7-b62a-8f7ce9bf0536.png)
- 旅游场景-宣传册、手册抽取
![image](https://user-images.githubusercontent.com/40840292/169018113-c937eb0b-9fd7-4ecc-8615-bcdde2dac81d.png)
<a name="3"></a>
## 3. 文本信息抽取
<a name="31"></a>
#### 3.1 实体抽取
实体抽取,又称命名实体识别(Named Entity Recognition,简称NER),是指识别文本中具有特定意义的实体。在开放域信息抽取中,抽取的类别没有限制,用户可以自己定义。
- 例如抽取的目标实体类型是"时间"、"选手"和"赛事名称", schema构造如下:
```text
['时间', '选手', '赛事名称']
```
调用示例:
```python
>>> from pprint import pprint
>>> from paddlenlp import Taskflow
>>> schema = ['时间', '选手', '赛事名称'] # Define the schema for entity extraction
>>> ie = Taskflow('information_extraction', schema=schema)
>>> pprint(ie("2月8日上午北京冬奥会自由式滑雪女子大跳台决赛中中国选手谷爱凌以188.25分获得金牌!")) # Better print results using pprint
[{'时间': [{'end': 6,
'probability': 0.9857378532924486,
'start': 0,
'text': '2月8日上午'}],
'赛事名称': [{'end': 23,
'probability': 0.8503089953268272,
'start': 6,
'text': '北京冬奥会自由式滑雪女子大跳台决赛'}],
'选手': [{'end': 31,
'probability': 0.8981548639781138,
'start': 28,
'text': '谷爱凌'}]}]
```
- 例如抽取的目标实体类型是"肿瘤的大小"、"肿瘤的个数"、"肝癌级别"和"脉管内癌栓分级", schema构造如下:
```text
['肿瘤的大小', '肿瘤的个数', '肝癌级别', '脉管内癌栓分级']
```
在上例中我们已经实例化了一个`Taskflow`对象,这里可以通过`set_schema`方法重置抽取目标。
调用示例:
```python
>>> schema = ['肿瘤的大小', '肿瘤的个数', '肝癌级别', '脉管内癌栓分级']
>>> ie.set_schema(schema)
>>> pprint(ie("(右肝肿瘤)肝细胞性肝癌(II-III级,梁索型和假腺管型),肿瘤包膜不完整,紧邻肝被膜,侵及周围肝组织,未见脉管内癌栓(MVI分级:M0级)及卫星子灶形成。(肿物1个,大小4.2×4.0×2.8cm)。"))
[{'肝癌级别': [{'end': 20,
'probability': 0.9243267447402701,
'start': 13,
'text': 'II-III级'}],
'肿瘤的个数': [{'end': 84,
'probability': 0.7538413804059623,
'start': 82,
'text': '1个'}],
'肿瘤的大小': [{'end': 100,
'probability': 0.8341128043459491,
'start': 87,
'text': '4.2×4.0×2.8cm'}],
'脉管内癌栓分级': [{'end': 70,
'probability': 0.9083292325934664,
'start': 67,
'text': 'M0级'}]}]
```
- 例如抽取的目标实体类型是"person"和"organization",schema构造如下:
```text
['person', 'organization']
```
英文模型调用示例:
```python
>>> from pprint import pprint
>>> from paddlenlp import Taskflow
>>> schema = ['Person', 'Organization']
>>> ie_en = Taskflow('information_extraction', schema=schema, model='uie-base-en')
>>> pprint(ie_en('In 1997, Steve was excited to become the CEO of Apple.'))
[{'Organization': [{'end': 53,
'probability': 0.9985840259877357,
'start': 48,
'text': 'Apple'}],
'Person': [{'end': 14,
'probability': 0.999631971804547,
'start': 9,
'text': 'Steve'}]}]
```
<a name="32"></a>
#### 3.2 关系抽取
关系抽取(Relation Extraction,简称RE),是指从文本中识别实体并抽取实体之间的语义关系,进而获取三元组信息,即<主体,谓语,客体>。
- 例如以"竞赛名称"作为抽取主体,抽取关系类型为"主办方"、"承办方"和"已举办次数", schema构造如下:
```text
{
'竞赛名称': [
'主办方',
'承办方',
'已举办次数'
]
}
```
调用示例:
```python
>>> schema = {'竞赛名称': ['主办方', '承办方', '已举办次数']} # Define the schema for relation extraction
>>> ie.set_schema(schema) # Reset schema
>>> pprint(ie('2022语言与智能技术竞赛由中国中文信息学会和中国计算机学会联合主办,百度公司、中国中文信息学会评测工作委员会和中国计算机学会自然语言处理专委会承办,已连续举办4届,成为全球最热门的中文NLP赛事之一。'))
[{'竞赛名称': [{'end': 13,
'probability': 0.7825402622754041,
'relations': {'主办方': [{'end': 22,
'probability': 0.8421710521379353,
'start': 14,
'text': '中国中文信息学会'},
{'end': 30,
'probability': 0.7580801847701935,
'start': 23,
'text': '中国计算机学会'}],
'已举办次数': [{'end': 82,
'probability': 0.4671295049136148,
'start': 80,
'text': '4届'}],
'承办方': [{'end': 39,
'probability': 0.8292706618236352,
'start': 35,
'text': '百度公司'},
{'end': 72,
'probability': 0.6193477885474685,
'start': 56,
'text': '中国计算机学会自然语言处理专委会'},
{'end': 55,
'probability': 0.7000497331473241,
'start': 40,
'text': '中国中文信息学会评测工作委员会'}]},
'start': 0,
'text': '2022语言与智能技术竞赛'}]}]
```
- 例如以"person"作为抽取主体,抽取关系类型为"Company"和"Position", schema构造如下:
```text
{
'Person': [
'Company',
'Position'
]
}
```
英文模型调用示例:
```python
>>> schema = [{'Person': ['Company', 'Position']}]
>>> ie_en.set_schema(schema)
>>> pprint(ie_en('In 1997, Steve was excited to become the CEO of Apple.'))
[{'Person': [{'end': 14,
'probability': 0.999631971804547,
'relations': {'Company': [{'end': 53,
'probability': 0.9960158209451642,
'start': 48,
'text': 'Apple'}],
'Position': [{'end': 44,
'probability': 0.8871063806420736,
'start': 41,
'text': 'CEO'}]},
'start': 9,
'text': 'Steve'}]}]
```
<a name="33"></a>
#### 3.3 事件抽取
事件抽取 (Event Extraction, 简称EE),是指从自然语言文本中抽取预定义的事件触发词(Trigger)和事件论元(Argument),组合为相应的事件结构化信息。
- 例如抽取的目标是"地震"事件的"地震强度"、"时间"、"震中位置"和"震源深度"这些信息,schema构造如下:
```text
{
'地震触发词': [
'地震强度',
'时间',
'震中位置',
'震源深度'
]
}
```
触发词的格式统一为`触发词`或``XX触发词`,`XX`表示具体事件类型,上例中的事件类型是`地震`,则对应触发词为`地震触发词`。
调用示例:
```python
>>> schema = {'地震触发词': ['地震强度', '时间', '震中位置', '震源深度']} # Define the schema for event extraction
>>> ie.set_schema(schema) # Reset schema
>>> ie('中国地震台网正式测定:5月16日06时08分在云南临沧市凤庆县(北纬24.34度,东经99.98度)发生3.5级地震,震源深度10千米。')
[{'地震触发词': [{'text': '地震', 'start': 56, 'end': 58, 'probability': 0.9987181623528585, 'relations': {'地震强度': [{'text': '3.5级', 'start': 52, 'end': 56, 'probability': 0.9962985320905915}], '时间': [{'text': '5月16日06时08分', 'start': 11, 'end': 22, 'probability': 0.9882578028575182}], '震中位置': [{'text': '云南临沧市凤庆县(北纬24.34度,东经99.98度)', 'start': 23, 'end': 50, 'probability': 0.8551415716584501}], '震源深度': [{'text': '10千米', 'start': 63, 'end': 67, 'probability': 0.999158304648045}]}}]}]
```
- 英文模型**暂不支持事件抽取**,如有需要可使用英文事件数据集进行定制。
<a name="34"></a>
#### 3.4 评论观点抽取
评论观点抽取,是指抽取文本中包含的评价维度、观点词。
- 例如抽取的目标是文本中包含的评价维度及其对应的观点词和情感倾向,schema构造如下:
```text
{
'评价维度': [
'观点词',
'情感倾向[正向,负向]'
]
}
```
调用示例:
```python
>>> schema = {'评价维度': ['观点词', '情感倾向[正向,负向]']} # Define the schema for opinion extraction
>>> ie.set_schema(schema) # Reset schema
>>> pprint(ie("店面干净,很清静,服务员服务热情,性价比很高,发现收银台有排队")) # Better print results using pprint
[{'评价维度': [{'end': 20,
'probability': 0.9817040258681473,
'relations': {'情感倾向[正向,负向]': [{'probability': 0.9966142505350533,
'text': '正向'}],
'观点词': [{'end': 22,
'probability': 0.957396472711558,
'start': 21,
'text': '高'}]},
'start': 17,
'text': '性价比'},
{'end': 2,
'probability': 0.9696849569741168,
'relations': {'情感倾向[正向,负向]': [{'probability': 0.9982153274927796,
'text': '正向'}],
'观点词': [{'end': 4,
'probability': 0.9945318044652538,
'start': 2,
'text': '干净'}]},
'start': 0,
'text': '店面'}]}]
```
- 英文模型schema构造如下:
```text
{
'Aspect': [
'Opinion',
'Sentiment classification [negative, positive]'
]
}
```
调用示例:
```python
>>> schema = [{'Aspect': ['Opinion', 'Sentiment classification [negative, positive]']}]
>>> ie_en.set_schema(schema)
>>> pprint(ie_en("The teacher is very nice."))
[{'Aspect': [{'end': 11,
'probability': 0.4301476415932193,
'relations': {'Opinion': [{'end': 24,
'probability': 0.9072940447883724,
'start': 15,
'text': 'very nice'}],
'Sentiment classification [negative, positive]': [{'probability': 0.9998571920670685,
'text': 'positive'}]},
'start': 4,
'text': 'teacher'}]}]
```
<a name="35"></a>
#### 3.5 情感分类
- 句子级情感倾向分类,即判断句子的情感倾向是“正向”还是“负向”,schema构造如下:
```text
'情感倾向[正向,负向]'
```
调用示例:
```python
>>> schema = '情感倾向[正向,负向]' # Define the schema for sentence-level sentiment classification
>>> ie.set_schema(schema) # Reset schema
>>> ie('这个产品用起来真的很流畅,我非常喜欢')
[{'情感倾向[正向,负向]': [{'text': '正向', 'probability': 0.9988661643929895}]}]
```
英文模型schema构造如下:
```text
'Sentiment classification [negative, positive]'
```
英文模型调用示例:
```python
>>> schema = 'Sentiment classification [negative, positive]'
>>> ie_en.set_schema(schema)
>>> ie_en('I am sorry but this is the worst film I have ever seen in my life.')
[{'Sentiment classification [negative, positive]': [{'text': 'negative', 'probability': 0.9998415771287057}]}]
```
<a name="36"></a>
#### 3.6 跨任务抽取
- 例如在法律场景同时对文本进行实体抽取和关系抽取,schema可按照如下方式进行构造:
```text
[
"法院",
{
"原告": "委托代理人"
},
{
"被告": "委托代理人"
}
]
```
调用示例:
```python
>>> schema = ['法院', {'原告': '委托代理人'}, {'被告': '委托代理人'}]
>>> ie.set_schema(schema)
>>> pprint(ie("北京市海淀区人民法院\n民事判决书\n(199x)建初字第xxx号\n原告:张三。\n委托代理人李四,北京市 A律师事务所律师。\n被告:B公司,法定代表人王五,开发公司总经理。\n委托代理人赵六,北京市 C律师事务所律师。")) # Better print results using pprint
[{'原告': [{'end': 37,
'probability': 0.9949814024296764,
'relations': {'委托代理人': [{'end': 46,
'probability': 0.7956844697990384,
'start': 44,
'text': '李四'}]},
'start': 35,
'text': '张三'}],
'法院': [{'end': 10,
'probability': 0.9221074192336651,
'start': 0,
'text': '北京市海淀区人民法院'}],
'被告': [{'end': 67,
'probability': 0.8437349536631089,
'relations': {'委托代理人': [{'end': 92,
'probability': 0.7267121388225029,
'start': 90,
'text': '赵六'}]},
'start': 64,
'text': 'B公司'}]}]
```
<a name="37"></a>
#### 3.7 模型选择
- 多模型选择,满足精度、速度要求
| 模型 | 结构 | 语言 |
| :---: | :--------: | :--------: |
| `uie-base` (默认)| 12-layers, 768-hidden, 12-heads | 中文 |
| `uie-base-en` | 12-layers, 768-hidden, 12-heads | 英文 |
| `uie-medical-base` | 12-layers, 768-hidden, 12-heads | 中文 |
| `uie-medium`| 6-layers, 768-hidden, 12-heads | 中文 |
| `uie-mini`| 6-layers, 384-hidden, 12-heads | 中文 |
| `uie-micro`| 4-layers, 384-hidden, 12-heads | 中文 |
| `uie-nano`| 4-layers, 312-hidden, 12-heads | 中文 |
| `uie-m-large`| 24-layers, 1024-hidden, 16-heads | 中、英文 |
| `uie-m-base`| 12-layers, 768-hidden, 12-heads | 中、英文 |
- `uie-nano`调用示例:
```python
>>> from paddlenlp import Taskflow
>>> schema = ['时间', '选手', '赛事名称']
>>> ie = Taskflow('information_extraction', schema=schema, model="uie-nano")
>>> ie("2月8日上午北京冬奥会自由式滑雪女子大跳台决赛中中国选手谷爱凌以188.25分获得金牌!")
[{'时间': [{'text': '2月8日上午', 'start': 0, 'end': 6, 'probability': 0.6513581678349247}], '选手': [{'text': '谷爱凌', 'start': 28, 'end': 31, 'probability': 0.9819330659468051}], '赛事名称': [{'text': '北京冬奥会自由式滑雪女子大跳台决赛', 'start': 6, 'end': 23, 'probability': 0.4908131110420939}]}]
```
- `uie-m-base`和`uie-m-large`支持中英文混合抽取,调用示例:
```python
>>> from pprint import pprint
>>> from paddlenlp import Taskflow
>>> schema = ['Time', 'Player', 'Competition', 'Score']
>>> ie = Taskflow('information_extraction', schema=schema, model="uie-m-base", schema_lang="en")
>>> pprint(ie(["2月8日上午北京冬奥会自由式滑雪女子大跳台决赛中中国选手谷爱凌以188.25分获得金牌!", "Rafael Nadal wins French Open Final!"]))
[{'Competition': [{'end': 23,
'probability': 0.9373889907291257,
'start': 6,
'text': '北京冬奥会自由式滑雪女子大跳台决赛'}],
'Player': [{'end': 31,
'probability': 0.6981119555336441,
'start': 28,
'text': '谷爱凌'}],
'Score': [{'end': 39,
'probability': 0.9888507878270296,
'start': 32,
'text': '188.25分'}],
'Time': [{'end': 6,
'probability': 0.9784080036931151,
'start': 0,
'text': '2月8日上午'}]},
{'Competition': [{'end': 35,
'probability': 0.9851549932171295,
'start': 18,
'text': 'French Open Final'}],
'Player': [{'end': 12,
'probability': 0.9379371275888104,
'start': 0,
'text': 'Rafael Nadal'}]}]
```
<a name="38"></a>
#### 3.8 更多配置
```python
>>> from paddlenlp import Taskflow
>>> ie = Taskflow('information_extraction',
schema="",
schema_lang="ch",
batch_size=16,
model='uie-base',
position_prob=0.5,
precision='fp32',
use_fast=False)
```
* `schema`:定义任务抽取目标,可参考开箱即用中不同任务的调用示例进行配置。
* `schema_lang`:设置schema的语言,默认为`ch`, 可选有`ch`和`en`。因为中英schema的构造有所不同,因此需要指定schema的语言。该参数只对`uie-x-base`,`uie-m-base`和`uie-m-large`模型有效。
* `batch_size`:批处理大小,请结合机器情况进行调整,默认为16。
* `model`:选择任务使用的模型,默认为`uie-base`,可选有`uie-base`, `uie-medium`, `uie-mini`, `uie-micro`, `uie-nano`和`uie-medical-base`, `uie-base-en`,`uie-x-base`。
* `position_prob`:模型对于span的起始位置/终止位置的结果概率在0~1之间,返回结果去掉小于这个阈值的结果,默认为0.5,span的最终概率输出为起始位置概率和终止位置概率的乘积。
* `precision`:选择模型精度,默认为`fp32`,可选有`fp16`和`fp32`。`fp16`推理速度更快,支持GPU和NPU硬件环境。如果选择`fp16`,在GPU硬件环境下,请先确保机器正确安装NVIDIA相关驱动和基础软件,**确保CUDA>=11.2,cuDNN>=8.1.1**,初次使用需按照提示安装相关依赖。其次,需要确保GPU设备的CUDA计算能力(CUDA Compute Capability)大于7.0,典型的设备包括V100、T4、A10、A100、GTX 20系列和30系列显卡等。更多关于CUDA Compute Capability和精度支持情况请参考NVIDIA文档:[GPU硬件与支持精度对照表](https://docs.nvidia.com/deeplearning/tensorrt/archives/tensorrt-840-ea/support-matrix/index.html#hardware-precision-matrix)。
* `use_fast`: 使用C++实现的高性能分词算子FastTokenizer进行文本预处理加速。需要通过`pip install fast-tokenizer-python`安装FastTokenizer库后方可使用。默认为`False`。更多使用说明可参考[FastTokenizer文档](../../fast_tokenizer)。
# UIE Taskflow User Guide - Text Information Extraction
**Table of contents**
- [1. Introduction](#1)
- [2. Examples](#2)
- [3. Text Information Extraction](#3)
- [3.1 Entity Extraction](#31)
- [3.2 Relation Extraction](#32)
- [3.3 Event Extraction](#33)
- [3.4 Opinion Extraction](#34)
- [3.5 Sentiment Classification](#35)
- [3.6 Multi-task Extraction](#36)
- [3.7 Available Models](#37)
- [3.8 More Configuration](#38)
<a name="1"></a>
## 1. Introduction
```paddlenlp.Taskflow``` provides general information extraction of text and documents, evaluation opinion extraction and other capabilities, and can extract various types of information, including but not limited to named entities (such as person name, place name, organization name, etc.), relations (such as the director of the movie, the release time of the song, etc.), events (such as a car accident at a certain intersection, an earthquake in a certain place, etc.), and information such as product reviews, opinions, and sentiments. Users can use natural language to customize the extraction target, and can uniformly extract the corresponding information in the input text or document without training.
<a name="2"></a>
## 2. Examples
UIE does not limit industry fields and extraction targets. The following are some industry examples implemented out of the box by Taskflow:
- Medical scenarios - specialized disease structure
![image](https://user-images.githubusercontent.com/40840292/169017581-93c8ee44-856d-4d17-970c-b6138d10f8bc.png)
- Legal scene - Judgment extraction
![image](https://user-images.githubusercontent.com/40840292/169017863-442c50f1-bfd4-47d0-8d95-8b1d53cfba3c.png)
- Financial scenarios - proof of income, extraction of prospectus
![image](https://user-images.githubusercontent.com/40840292/169017982-e521ddf6-d233-41f3-974e-6f40f8f2edbc.png)
- Public security scene - accident report extraction
![image](https://user-images.githubusercontent.com/40840292/169018340-31efc1bf-f54d-43f7-b62a-8f7ce9bf0536.png)
- Tourism scene - brochure, manual extraction
![image](https://user-images.githubusercontent.com/40840292/169018113-c937eb0b-9fd7-4ecc-8615-bcdde2dac81d.png)
<a name="3"></a>
## 3. Text information extraction
<a name="31"></a>
#### 3.1 Entity Extraction
Entity extraction, also known as Named Entity Recognition (NER for short), refers to identifying entities with specific meanings in text. In the open domain information extraction, the extracted categories are not limited, and users can define them by themselves.
- For example, the extracted target entity types are "person" and "organization", and the schema defined as follows:
```text
['person', 'organization']
```
Example:
```python
>>> from pprint import pprint
>>> from paddlenlp import Taskflow
>>> schema = ['Person', 'Organization']
>>> ie_en = Taskflow('information_extraction', schema=schema, model='uie-base-en')
>>> pprint(ie_en('In 1997, Steve was excited to become the CEO of Apple.'))
[{'Organization': [{'end': 53,
'probability': 0.9985840259877357,
'start': 48,
'text': 'Apple'}],
'Person': [{'end': 14,
'probability': 0.999631971804547,
'start': 9,
'text': 'Steve'}]}]
```
<a name="32"></a>
#### 3.2 Relationship Extraction
Relation Extraction refers to identifying entities from text and extracting the semantic relationship between entities, and then obtaining triple information, namely <subject, predicate, object>.
- For example, if "person" is used as the extraction subject, and the extraction relationship types are "Company" and "Position", the schema structure is as follows:
```text
{
'Person': [
'Company',
'Position'
]
}
```
Example:
```python
>>> schema = [{'Person': ['Company', 'Position']}]
>>> ie_en.set_schema(schema)
>>> pprint(ie_en('In 1997, Steve was excited to become the CEO of Apple.'))
[{'Person': [{'end': 14,
'probability': 0.999631971804547,
'relations': {'Company': [{'end': 53,
'probability': 0.9960158209451642,
'start': 48,
'text': 'Apple'}],
'Position': [{'end': 44,
'probability': 0.8871063806420736,
'start': 41,
'text': 'CEO'}]},
'start': 9,
'text': 'Steve'}]}]
```
<a name="33"></a>
#### 3.3 Event extraction
Event Extraction refers to extracting predefined event trigger words (Trigger) and event arguments (Argument) from natural language texts, and combining them into corresponding event structured information.
- The English model** does not support event extraction**, if necessary, it can be customized using the English event dataset.
<a name="34"></a>
#### 3.4 Opinion Extraction
Opinion extraction refers to the extraction of evaluation dimensions and opinion words contained in the text.
- For example, the target of extraction is the evaluation dimension contained in the text and its corresponding opinion words and emotional tendencies. The schema structure is as follows:
```text
{
'Aspect': [
'Opinion',
'Sentiment classification [negative, positive]'
]
}
```
Example:
```python
>>> schema = [{'Aspect': ['Opinion', 'Sentiment classification [negative, positive]']}]
>>> ie_en.set_schema(schema)
>>> pprint(ie_en("The teacher is very nice."))
[{'Aspect': [{'end': 11,
'probability': 0.4301476415932193,
'relations': {'Opinion': [{'end': 24,
'probability': 0.9072940447883724,
'start': 15,
'text': 'very nice'}],
'Sentiment classification [negative, positive]': [{'probability': 0.9998571920670685,
'text': 'positive'}]},
'start': 4,
'text': 'teacher'}]}]
```
<a name="35"></a>
#### 3.5 Sentiment Classification
- Sentence-level sentiment classification, that is, to judge whether the emotional orientation of a sentence is "positive" or "negative". The schema structure is as follows:
```text
'Sentiment classification [negative, positive]'
```
Example:
```python
>>> schema = 'Sentiment classification [negative, positive]'
>>> ie_en.set_schema(schema)
>>> ie_en('I am sorry but this is the worst film I have ever seen in my life.')
[{'Sentiment classification [negative, positive]': [{'text': 'negative', 'probability': 0.9998415771287057}]}]
```
#### 3.6 Multi-Task Extraction
- For example, in the legal scene, entity extraction and relation extraction are performed on the text at the same time, and the schema can be constructed as follows:
```text
[
"法院",
{
"原告": "委托代理人"
},
{
"被告": "委托代理人"
}
]
```
Example:
```python
>>> schema = ['法院', {'原告': '委托代理人'}, {'被告': '委托代理人'}]
>>> ie.set_schema(schema)
>>> pprint(ie("北京市海淀区人民法院\n民事判决书\n(199x)建初字第xxx号\n原告:张三。\n委托代理人李四,北京市 A律师事务所律师。\n被告:B公司,法定代表人王五,开发公司总经理。\n委托代理人赵六,北京市 C律师事务所律师。")) # Better print results using pprint
[{'原告': [{'end': 37,
'probability': 0.9949814024296764,
'relations': {'委托代理人': [{'end': 46,
'probability': 0.7956844697990384,
'start': 44,
'text': '李四'}]},
'start': 35,
'text': '张三'}],
'法院': [{'end': 10,
'probability': 0.9221074192336651,
'start': 0,
'text': '北京市海淀区人民法院'}],
'被告': [{'end': 67,
'probability': 0.8437349536631089,
'relations': {'委托代理人': [{'end': 92,
'probability': 0.7267121388225029,
'start': 90,
'text': '赵六'}]},
'start': 64,
'text': 'B公司'}]}]
```
<a name="37"></a>
#### 3.7 Available Model
- A variety of models to different accuracy and speed requirements
| Model | Structure | Language |
| :---: | :--------: | :--------: |
| `uie-base` (default)| 12-layers, 768-hidden, 12-heads | Chinese |
| `uie-base-en` | 12-layers, 768-hidden, 12-heads | English |
| `uie-medical-base` | 12-layers, 768-hidden, 12-heads | Chinese |
| `uie-medium`| 6-layers, 768-hidden, 12-heads | Chinese |
| `uie-mini`| 6-layers, 384-hidden, 12-heads | Chinese |
| `uie-micro`| 4-layers, 384-hidden, 12-heads | Chinese |
| `uie-nano`| 4-layers, 312-hidden, 12-heads | Chinese |
| `uie-m-large`| 24-layers, 1024-hidden, 16-heads | Chinese and English |
| `uie-m-base`| 12-layers, 768-hidden, 12-heads | Chinese and English |
- `uie-nano` call example:
```python
>>> from paddlenlp import Taskflow
>>> schema = ['时间', '选手', '赛事名称']
>>> ie = Taskflow('information_extraction', schema=schema, model="uie-nano")
>>> ie("2月8日上午北京冬奥会自由式滑雪女子大跳台决赛中中国选手谷爱凌以188.25分获得金牌!")
[{'时间': [{'text': '2月8日上午', 'start': 0, 'end': 6, 'probability': 0.6513581678349247}], '选手': [{'text': '谷爱凌', 'start': 28, 'end': 31, 'probability': 0.9819330659468051}], '赛事名称': [{'text': '北京冬奥会自由式滑雪女子大跳台决赛', 'start': 6, 'end': 23, 'probability': 0.4908131110420939}]}]
```
- `uie-m-base` and `uie-m-large` support extraction of both Chinese and English, call example:
```python
>>> from pprint import pprint
>>> from paddlenlp import Taskflow
>>> schema = ['Time', 'Player', 'Competition', 'Score']
>>> ie = Taskflow('information_extraction', schema=schema, model="uie-m-base", schema_lang="en")
>>> pprint(ie(["2月8日上午北京冬奥会自由式滑雪女子大跳台决赛中中国选手谷爱凌以188.25分获得金牌!", "Rafael Nadal wins French Open Final!"]))
[{'Competition': [{'end': 23,
'probability': 0.9373889907291257,
'start': 6,
'text': '北京冬奥会自由式滑雪女子大跳台决赛'}],
'Player': [{'end': 31,
'probability': 0.6981119555336441,
'start': 28,
'text': '谷爱凌'}],
'Score': [{'end': 39,
'probability': 0.9888507878270296,
'start': 32,
'text': '188.25分'}],
'Time': [{'end': 6,
'probability': 0.9784080036931151,
'start': 0,
'text': '2月8日上午'}]},
{'Competition': [{'end': 35,
'probability': 0.9851549932171295,
'start': 18,
'text': 'French Open Final'}],
'Player': [{'end': 12,
'probability': 0.9379371275888104,
'start': 0,
'text': 'Rafael Nadal'}]}]
```
<a name="38"></a>
#### 3.8 More Configuration
```python
>>> from paddlenlp import Taskflow
>>> ie = Taskflow('information_extraction',
schema="",
schema_lang="ch",
batch_size=16,
model='uie-base',
position_prob=0.5,
precision='fp32',
use_fast=False)
```
* `schema`: Define the task extraction target, which can be configured by referring to the calling examples of different tasks in the out-of-the-box.
* `schema_lang`: Set the language of the schema, the default is `ch`, optional `ch` and `en`. Because the structure of the Chinese and English schemas is different, the language of the schema needs to be specified. This parameter is only valid for `uie-x-base`, `uie-m-base` and `uie-m-large` models.
* `batch_size`: batch size, please adjust according to the machine situation, the default is 16.
* `model`: select the model used by the task, the default is `uie-base`, optional `uie-base`, `uie-medium`, `uie-mini`, `uie-micro`, `uie-nano` and `uie-medical-base`, `uie-base-en`, `uie-x-base`.
* `position_prob`: The result probability of the model for the start position/end position of the span is between 0 and 1, and the returned result removes the results less than this threshold, the default is 0.5, and the final probability output of the span is the start position probability and end position The product of the position probabilities.
* `precision`: select the model precision, the default is `fp32`, optional `fp16` and `fp32`. `fp16` inference is faster, support GPU and NPU hardware. If you choose `fp16` and GPU hardware, please ensure that the machine is correctly installed with NVIDIA-related drivers and basic software. **Ensure that CUDA>=11.2, cuDNN>=8.1.1**. For the first time use, you need to follow the prompts to install the relevant dependencies. Secondly, it is necessary to ensure that the CUDA Compute Capability of the GPU device is greater than 7.0. Typical devices include V100, T4, A10, A100, GTX 20 series and 30 series graphics cards, etc. For more information about CUDA Compute Capability and precision support, please refer to NVIDIA documentation: [GPU Hardware and Supported Precision Comparison Table](https://docs.nvidia.com/deeplearning/tensorrt/archives/tensorrt-840-ea/support-matrix/index.html#hardware-precision-matrix).
* `use_fast`: Use the high-performance word segmentation operator FastTokenizer implemented in C++ to accelerate text preprocessing. The FastTokenizer library needs to be installed through `pip install fast-tokenizer-python` before it can be used. Defaults to `False`. For more usage instructions, please refer to [FastTokenizer Documentation](../../fast_tokenizer).
简体中文 | [English](README_en.md)
# 文本信息抽取
**目录**
- [1. 文本信息抽取应用](#1)
- [2. 快速开始](#2)
- [2.1 代码结构](#代码结构)
- [2.2 数据标注](#数据标注)
- [2.3 模型微调](#模型微调)
- [2.4 模型评估](#模型评估)
- [2.5 定制模型一键预测](#定制模型一键预测)
- [2.6 实验指标](#实验指标)
- [2.7 封闭域蒸馏](#封闭域蒸馏)
<a name="1"></a>
## 1. 文本信息抽取应用
本项目提供基于UIE微调的纯文本抽取端到端应用方案,打通**数据标注-模型训练-模型调优-预测部署全流程**,可快速实现文档信息抽取产品落地。
信息抽取通俗地说就是从给定的文本/图片等输入数据中抽取出结构化信息的过程。在信息抽取的落地过程中通常面临领域多变、任务多样、数据稀缺等许多挑战。针对信息抽取领域的难点和痛点,PaddleNLP信息抽取应用UIE统一建模的思想,提供了文档信息抽取产业级应用方案,支持**文档/图片/表格和纯文本场景下实体、关系、事件、观点等不同任务信息抽取**。该应用**不限定行业领域和抽取目标**,可实现从产品原型研发、业务POC阶段到业务落地、迭代阶段的无缝衔接,助力开发者实现特定领域抽取场景的快速适配与落地。
**文本信息抽取应用亮点:**
- **覆盖场景全面🎓:** 覆盖文本信息抽取各类主流任务,支持多语言,满足开发者多样信息抽取落地需求。
- **效果领先🏃:** 以在纯文本具有突出效果的UIE系列模型作为训练基座,提供多种尺寸的预训练模型满足不同需求,具有广泛成熟的实践应用性。
- **简单易用⚡:** 通过Taskflow实现三行代码可实现无标注数据的情况下进行快速调用,一行命令即可开启信息抽取训练,轻松完成部署上线,降低信息抽取技术落地门槛。
- **高效调优✊:** 开发者无需机器学习背景知识,即可轻松上手数据标注及模型训练流程。
<a name="2"></a>
## 2. 快速开始
对于简单的抽取目标可以直接使用```paddlenlp.Taskflow```实现零样本(zero-shot)抽取,对于细分场景我们推荐使用定制功能(标注少量数据进行模型微调)以进一步提升效果。
<a name="代码结构"></a>
### 2.1 代码结构
```shell
.
├── utils.py # 数据处理工具
├── finetune.py # 模型微调、压缩脚本
├── evaluate.py # 模型评估脚本
└── README.md
```
<a name="数据标注"></a>
### 2.2 数据标注
我们推荐使用 [Label Studio](https://labelstud.io/) 进行文本信息抽取数据标注,本项目打通了从数据标注到训练的通道,也即Label Studio导出数据可以通过 [label_studio.py](../label_studio.py) 脚本轻松将数据转换为输入模型时需要的形式,实现无缝衔接。标注方法的详细介绍请参考 [Label Studio数据标注指南](../label_studio_text.md)
这里我们提供预先标注好的`军事关系抽取数据集`的文件,可以运行下面的命令行下载数据集,我们将展示如何使用数据转化脚本生成训练/验证/测试集文件,并使用UIE模型进行微调。
下载军事关系抽取数据集:
```shell
wget https://bj.bcebos.com/paddlenlp/datasets/military.tar.gz
tar -xvf military.tar.gz
mv military data
rm military.tar.gz
```
生成训练/验证集文件:
```shell
python ../label_studio.py \
--label_studio_file ./data/label_studio.json \
--save_dir ./data \
--splits 0.76 0.24 0 \
--negative_ratio 3 \
--task_type ext
```
更多不同类型任务(含实体抽取、关系抽取、文档分类等)的标注规则及参数说明,请参考[Label Studio数据标注指南](../label_studio_text.md)
<a name="模型微调"></a>
### 2.3 模型微调
推荐使用 [Trainer API ](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/docs/trainer.md) 对模型进行微调。只需输入模型、数据集等就可以使用 Trainer API 高效快速地进行预训练、微调和模型压缩等任务,可以一键启动多卡训练、混合精度训练、梯度累积、断点重启、日志显示等功能,Trainer API 还针对训练过程的通用训练配置做了封装,比如:优化器、学习率调度等。
使用下面的命令,使用 `uie-base` 作为预训练模型进行模型微调,将微调后的模型保存至`$finetuned_model`
单卡启动:
```shell
python finetune.py \
--device gpu \
--logging_steps 10 \
--save_steps 100 \
--eval_steps 100 \
--seed 1000 \
--model_name_or_path uie-base \
--output_dir ./checkpoint/model_best \
--train_path data/train.txt \
--dev_path data/dev.txt \
--max_seq_len 512 \
--per_device_train_batch_size 16 \
--per_device_eval_batch_size 16 \
--num_train_epochs 20 \
--learning_rate 1e-5 \
--do_train \
--do_eval \
--do_export \
--export_model_dir ./checkpoint/model_best \
--overwrite_output_dir \
--disable_tqdm True \
--metric_for_best_model eval_f1 \
--load_best_model_at_end True \
--save_total_limit 1
```
如果在GPU环境中使用,可以指定gpus参数进行多卡训练:
```shell
python -u -m paddle.distributed.launch --gpus "0,1" finetune.py \
--device gpu \
--logging_steps 10 \
--save_steps 100 \
--eval_steps 100 \
--seed 1000 \
--model_name_or_path uie-base \
--output_dir ./checkpoint/model_best \
--train_path data/train.txt \
--dev_path data/dev.txt \
--max_seq_len 512 \
--per_device_train_batch_size 8 \
--per_device_eval_batch_size 8 \
--num_train_epochs 20 \
--learning_rate 1e-5 \
--do_train \
--do_eval \
--do_export \
--export_model_dir ./checkpoint/model_best \
--overwrite_output_dir \
--disable_tqdm True \
--metric_for_best_model eval_f1 \
--load_best_model_at_end True \
--save_total_limit 1
```
该示例代码中由于设置了参数 `--do_eval`,因此在训练完会自动进行评估。
可配置参数说明:
* `device`: 训练设备,可选择 'cpu'、'gpu'、'npu' 其中的一种;默认为 GPU 训练。
* `logging_steps`: 训练过程中日志打印的间隔 steps 数,默认10。
* `save_steps`: 训练过程中保存模型 checkpoint 的间隔 steps 数,默认100。
* `eval_steps`: 训练过程中保存模型 checkpoint 的间隔 steps 数,默认100。
* `seed`:全局随机种子,默认为 42。
* `model_name_or_path`:进行 few shot 训练使用的预训练模型。默认为 "uie-x-base"。
* `output_dir`:必须,模型训练或压缩后保存的模型目录;默认为 `None`
* `train_path`:训练集路径;默认为 `None`
* `dev_path`:开发集路径;默认为 `None`
* `max_seq_len`:文本最大切分长度,输入超过最大长度时会对输入文本进行自动切分,默认为512。
* `per_device_train_batch_size`:用于训练的每个 GPU 核心/CPU 的batch大小,默认为8。
* `per_device_eval_batch_size`:用于评估的每个 GPU 核心/CPU 的batch大小,默认为8。
* `num_train_epochs`: 训练轮次,使用早停法时可以选择 100;默认为10。
* `learning_rate`:训练最大学习率,UIE-X 推荐设置为 1e-5;默认值为3e-5。
* `label_names`:训练数据标签label的名称,UIE-X 设置为'start_positions' 'end_positions';默认值为None。
* `do_train`:是否进行微调训练,设置该参数表示进行微调训练,默认不设置。
* `do_eval`:是否进行评估,设置该参数表示进行评估,默认不设置。
* `do_export`:是否进行导出,设置该参数表示进行静态图导出,默认不设置。
* `export_model_dir`:静态图导出地址,默认为None。
* `overwrite_output_dir`: 如果 `True`,覆盖输出目录的内容。如果 `output_dir` 指向检查点目录,则使用它继续训练。
* `disable_tqdm`: 是否使用tqdm进度条。
* `metric_for_best_model`:最优模型指标,UIE-X 推荐设置为 `eval_f1`,默认为None。
* `load_best_model_at_end`:训练结束后是否加载最优模型,通常与`metric_for_best_model`配合使用,默认为False。
* `save_total_limit`:如果设置次参数,将限制checkpoint的总数。删除旧的checkpoints `输出目录`,默认为None。
<a name="模型评估"></a>
### 2.4 模型评估
通过运行以下命令进行模型评估:
```shell
python evaluate.py \
--model_path ./checkpoint/model_best \
--test_path ./data/dev.txt \
--device gpu \
--batch_size 16 \
--max_seq_len 512
```
通过运行以下命令对 UIE-M 进行模型评估:
```
python evaluate.py \
--model_path ./checkpoint/model_best \
--test_path ./data/dev.txt \
--batch_size 16 \
--device gpu \
--max_seq_len 512 \
--multilingual
```
评估方式说明:采用单阶段评价的方式,即关系抽取、事件抽取等需要分阶段预测的任务对每一阶段的预测结果进行分别评价。验证/测试集默认会利用同一层级的所有标签来构造出全部负例。
可开启`debug`模式对每个正例类别分别进行评估,该模式仅用于模型调试:
```shell
python evaluate.py \
--model_path ./checkpoint/model_best \
--test_path ./data/dev.txt \
--debug
```
输出打印示例:
```text
[2022-11-21 12:48:41,794] [ INFO] - -----------------------------
[2022-11-21 12:48:41,795] [ INFO] - Class Name: 武器名称
[2022-11-21 12:48:41,795] [ INFO] - Evaluation Precision: 0.96667 | Recall: 0.96667 | F1: 0.96667
[2022-11-21 12:48:44,093] [ INFO] - -----------------------------
[2022-11-21 12:48:44,094] [ INFO] - Class Name: X的产国
[2022-11-21 12:48:44,094] [ INFO] - Evaluation Precision: 1.00000 | Recall: 0.99275 | F1: 0.99636
[2022-11-21 12:48:46,474] [ INFO] - -----------------------------
[2022-11-21 12:48:46,475] [ INFO] - Class Name: X的研发单位
[2022-11-21 12:48:46,475] [ INFO] - Evaluation Precision: 0.77519 | Recall: 0.64935 | F1: 0.70671
[2022-11-21 12:48:48,800] [ INFO] - -----------------------------
[2022-11-21 12:48:48,801] [ INFO] - Class Name: X的类型
[2022-11-21 12:48:48,801] [ INFO] - Evaluation Precision: 1.00000 | Recall: 1.00000 | F1: 1.00000
```
可配置参数说明:
- `device`: 评估设备,可选择 'cpu'、'gpu'、'npu' 其中的一种;默认为 GPU 评估。
- `model_path`: 进行评估的模型文件夹路径,路径下需包含模型权重文件`model_state.pdparams`及配置文件`model_config.json`
- `test_path`: 进行评估的测试集文件。
- `batch_size`: 批处理大小,请结合机器情况进行调整,默认为16。
- `max_seq_len`: 文本最大切分长度,输入超过最大长度时会对输入文本进行自动切分,默认为512。
- `debug`: 是否开启debug模式对每个正例类别分别进行评估,该模式仅用于模型调试,默认关闭。
- `multilingual`: 是否是跨语言模型,默认关闭。
- `schema_lang`: 选择schema的语言,可选有`ch``en`。默认为`ch`,英文数据集请选择`en`
<a name="定制模型一键预测"></a>
### 2.5 定制模型一键预测
`paddlenlp.Taskflow`装载定制模型,通过`task_path`指定模型权重文件的路径,路径下需要包含训练好的模型权重文件`model_state.pdparams`
```python
>>> from pprint import pprint
>>> from paddlenlp import Taskflow
>>> schema = {"武器名称": ["产国", "类型", "研发单位"]}
# 设定抽取目标和定制化模型权重路径
>>> my_ie = Taskflow("information_extraction", schema=schema, task_path='./checkpoint/model_best')
>>> pprint(my_ie("威尔哥(Virgo)减速炸弹是由瑞典FFV军械公司专门为瑞典皇家空军的攻击机实施低空高速轰炸而研制,1956年开始研制,1963年进入服役,装备于A32“矛盾”、A35“龙”、和AJ134“雷”攻击机,主要用于攻击登陆艇、停放的飞机、高炮、野战火炮、轻型防护装甲车辆以及有生力量。"))
[{'武器名称': [{'end': 14,
'probability': 0.9998632702221926,
'relations': {'产国': [{'end': 18,
'probability': 0.9998815094394331,
'start': 16,
'text': '瑞典'}],
'研发单位': [{'end': 25,
'probability': 0.9995875123178521,
'start': 18,
'text': 'FFV军械公司'}],
'类型': [{'end': 14,
'probability': 0.999877336059086,
'start': 12,
'text': '炸弹'}]},
'start': 0,
'text': '威尔哥(Virgo)减速炸弹'}]}]
```
<a name="实验指标"></a>
### 2.6 实验指标
军事关系抽取数据集实验指标:
| | Precision | Recall | F1 Score |
| :---: | :--------: | :--------: | :--------: |
| 0-shot | 0.64634| 0.53535 | 0.58564 |
| 5-shot | 0.89474 | 0.85000 | 0.87179 |
| 10-shot | 0.92793 | 0.85833 | 0.89177 |
| full-set | 0.93103 | 0.90000 | 0.91525 |
<a name="封闭域蒸馏"></a>
### 2.7 封闭域蒸馏
在一些工业应用场景中对性能的要求较高,模型若不能有效压缩则无法实际应用。因此,我们基于数据蒸馏技术构建了UIE Slim数据蒸馏系统。其原理是通过数据作为桥梁,将UIE模型的知识迁移到封闭域信息抽取小模型,以达到精度损失较小的情况下却能达到大幅度预测速度提升的效果。详细介绍请参考[UIE Slim 数据蒸馏](./data_distill/README.md)
# Text information extraction
**Table of contents**
- [1. Text Information Extraction Application](#1)
- [2. Quick Start](#2)
- [2.1 Code Structure](#21)
- [2.2 Data Annotation](#22)
- [2.3 Finetuning](#23)
- [2.4 Evaluation](#24)
- [2.5 Inference](#25)
- [2.6 Experiments](#26)
- [2.7 Closed Domain Distillation](#27)
<a name="1"></a>
## 1. Text Information Extraction Application
This project provides an end-to-end application solution for plain text extraction based on UIE fine-tuning and goes through the full lifecycle of **data labeling, model training and model deployment**. We hope this guide can help you apply Information Extraction techniques in your own products or models.a
Information Extraction (IE) is the process of extracting structured information from given input data such as text, pictures or scanned document. While IE brings immense value, applying IE techniques is never easy with challenges such as domain adaptation, heterogeneous structures, lack of labeled data, etc. This PaddleNLP Information Extraction Guide builds on the foundation of our work in [Universal Information Extraction](https://arxiv.org/abs/2203.12277) and provides an industrial-level solution that not only supports **extracting entities, relations, events and opinions from plain text**, but also supports **cross-modal extraction out of documents, tables and pictures.** Our method features a flexible prompt, which allows you to specify extraction targets with simple natural language. We also provide a few different domain-adapated models specialized for different industry sectors.
**Highlights:**
- **Comprehensive Coverage🎓:** Covers various mainstream tasks of information extraction for plain text and document scenarios, supports multiple languages
- **State-of-the-Art Performance🏃:** Strong performance from the UIE model series models in plain text and multimodal datasets. We also provide pretrained models of various sizes to meet different needs
- **Easy to use⚡:** three lines of code to use our `Taskflow` for out-of-box Information Extraction capabilities. One line of command to model training and model deployment
- **Efficient Tuning✊:** Developers can easily get started with the data labeling and model training process without a background in Machine Learning.
<a name="2"></a>
## 2. Quick start
For quick start, you can directly use ```paddlenlp.Taskflow``` out-of-the-box, leveraging the zero-shot performance. For production use cases, we recommend labeling a small amount of data for model fine-tuning to further improve the performance.
<a name="21"></a>
### 2.1 Code structure
```shell
.
├── utils.py # data processing tools
├── finetune.py # model fine-tuning, compression script
├── evaluate.py # model evaluation script
└── README.md
```
<a name="22"></a>
### 2.2 Data labeling
We recommend using [Label Studio](https://labelstud.io/) for data labeling. We provide an end-to-end pipeline for the labeling -> training process. You can export the labeled data in Label Studio through [label_studio.py](../label_studio.py) script to export and convert the data into the required input form for the model. For a detailed introduction to labeling methods, please refer to [Label Studio Data Labeling Guide](../label_studio_text_en.md).
Here we provide a pre-labeled example dataset `Military Relationship Extraction Dataset`, which you can download with the following command. We will show how to use the data conversion script to generate training/validation/test set files for fine-tuning .
Download the military relationship extraction dataset:
```shell
wget https://bj.bcebos.com/paddlenlp/datasets/military.tar.gz
tar -xvf military.tar.gz
mv military data
rm military.tar.gz
```
Generate training/validation set files:
```shell
python ../label_studio.py \
--label_studio_file ./data/label_studio.json \
--save_dir ./data \
--splits 0.76 0.24 0 \
--negative_ratio 3 \
--task_type ext
```
For more labeling rules and parameter descriptions for different types of tasks (including entity extraction, relationship extraction, document classification, etc.), please refer to [Label Studio Data Labeling Guide](../label_studio_text_en.md).
<a name="23"></a>
### 2.3 Finetuning
Use the following command to fine-tune the model using `uie-base` as the pre-trained model, and save the fine-tuned model to `$finetuned_model`:
Single GPU:
```shell
python finetune.py \
--device gpu \
--logging_steps 10 \
--save_steps 100 \
--eval_steps 100 \
--seed 1000 \
--model_name_or_path uie-base \
--output_dir ./checkpoint/model_best \
--train_path data/train.txt \
--dev_path data/dev.txt \
--max_seq_len 512 \
--per_device_train_batch_size 16 \
--per_device_eval_batch_size 16 \
--num_train_epochs 20 \
--learning_rate 1e-5 \
--do_train \
--do_eval \
--do_export \
--export_model_dir ./checkpoint/model_best \
--overwrite_output_dir \
--disable_tqdm True \
--metric_for_best_model eval_f1 \
--load_best_model_at_end True \
--save_total_limit 1
```
Multiple GPUs:
```shell
python -u -m paddle.distributed.launch --gpus "0,1" finetune.py \
--device gpu \
--logging_steps 10 \
--save_steps 100 \
--eval_steps 100 \
--seed 1000 \
--model_name_or_path uie-base \
--output_dir ./checkpoint/model_best \
--train_path data/train.txt \
--dev_path data/dev.txt \
--max_seq_len 512 \
--per_device_train_batch_size 8 \
--per_device_eval_batch_size 8 \
--num_train_epochs 20 \
--learning_rate 1e-5 \
--do_train \
--do_eval \
--do_export \
--export_model_dir ./checkpoint/model_best \
--overwrite_output_dir \
--disable_tqdm True \
--metric_for_best_model eval_f1 \
--load_best_model_at_end True \
--save_total_limit 1
```
Parameters:
* `device`: Training device, one of 'cpu', 'gpu' and 'npu' can be selected; the default is GPU training.
* `logging_steps`: The interval steps of log printing during training, the default is 10.
* `save_steps`: The number of interval steps to save the model checkpoint during training, the default is 100.
* `eval_steps`: The number of interval steps to save the model checkpoint during training, the default is 100.
* `seed`: global random seed, default is 42.
* `model_name_or_path`: The pre-trained model used for few shot training. Defaults to "uie-x-base".
* `output_dir`: required, the model directory saved after model training or compression; the default is `None`.
* `train_path`: training set path; defaults to `None`.
* `dev_path`: Development set path; defaults to `None`.
* `max_seq_len`: The maximum segmentation length of the text. When the input exceeds the maximum length, the input text will be automatically segmented. The default is 512.
* `per_device_train_batch_size`: The batch size of each GPU core//NPU core/CPU used for training, the default is 8.
* `per_device_eval_batch_size`: Batch size per GPU core/NPU core/CPU for evaluation, default is 8.
* `num_train_epochs`: Training rounds, 100 can be selected when using early stopping method; the default is 10.
* `learning_rate`: The maximum learning rate for training, UIE-X recommends setting it to 1e-5; the default value is 3e-5.
* `label_names`: the name of the training data label, UIE-X is set to 'start_positions' 'end_positions'; the default value is None.
* `do_train`: Whether to perform fine-tuning training, setting this parameter means to perform fine-tuning training, and it is not set by default.
* `do_eval`: Whether to evaluate, setting this parameter means to evaluate, the default is not set.
* `do_export`: Whether to export, setting this parameter means to export static images, and it is not set by default.
* `export_model_dir`: Static map export address, the default is None.
* `overwrite_output_dir`: If `True`, overwrite the contents of the output directory. If `output_dir` points to a checkpoint directory, use it to continue training.
* `disable_tqdm`: Whether to use tqdm progress bar.
* `metric_for_best_model`: Optimal model metric, UIE-X recommends setting it to `eval_f1`, the default is None.
* `load_best_model_at_end`: Whether to load the best model after training, usually used in conjunction with `metric_for_best_model`, the default is False.
* `save_total_limit`: If this parameter is set, the total number of checkpoints will be limited. Remove old checkpoints `output directory`, defaults to None.
<a name="24"></a>
### 2.4 Evaluation
Model evaluation:
```shell
python evaluate.py \
--model_path ./checkpoint/model_best \
--test_path ./data/dev.txt \
--batch_size 16 \
--max_seq_len 512
```
Model evaluation for UIE-M:
```
python evaluate.py \
--model_path ./checkpoint/model_best \
--test_path ./data/dev.txt \
--batch_size 16 \
--max_seq_len 512 \
--multilingual
```
We adopt the single-stage method for evaluation, which means tasks that require multiple stages (e.g. relation extraction, event extraction) are evaluated separately for each stage. By default, the validation/test set uses all labels at the same level to construct the negative examples.
The `debug` mode can be turned on to evaluate each positive category separately. This mode is only used for model debugging:
```shell
python evaluate.py \
--model_path ./checkpoint/model_best \
--test_path ./data/dev.txt \
--debug
```
Output print example:
```text
[2022-11-21 12:48:41,794] [ INFO] - -----------------------------
[2022-11-21 12:48:41,795] [ INFO] - Class Name: 武器名称
[2022-11-21 12:48:41,795] [ INFO] - Evaluation Precision: 0.96667 | Recall: 0.96667 | F1: 0.96667
[2022-11-21 12:48:44,093] [ INFO] - -----------------------------
[2022-11-21 12:48:44,094] [ INFO] - Class Name: X的产国
[2022-11-21 12:48:44,094] [ INFO] - Evaluation Precision: 1.00000 | Recall: 0.99275 | F1: 0.99636
[2022-11-21 12:48:46,474] [ INFO] - -----------------------------
[2022-11-21 12:48:46,475] [ INFO] - Class Name: X的研发单位
[2022-11-21 12:48:46,475] [ INFO] - Evaluation Precision: 0.77519 | Recall: 0.64935 | F1: 0.70671
[2022-11-21 12:48:48,800] [ INFO] - -----------------------------
[2022-11-21 12:48:48,801] [ INFO] - Class Name: X的类型
[2022-11-21 12:48:48,801] [ INFO] - Evaluation Precision: 1.00000 | Recall: 1.00000 | F1: 1.00000
```
Parameters:
- `device`: Evaluation device, one of 'cpu', 'gpu' and 'npu' can be selected; the default is GPU evaluation.
- `model_path`: The path of the model folder for evaluation, which must contain the model weight file `model_state.pdparams` and the configuration file `model_config.json`.
- `test_path`: The test set file for evaluation.
- `batch_size`: batch size, please adjust according to the machine situation, the default is 16.
- `max_seq_len`: The maximum segmentation length of the text. When the input exceeds the maximum length, the input text will be automatically segmented. The default is 512.
- `debug`: Whether to enable the debug mode to evaluate each positive category separately. This mode is only used for model debugging and is disabled by default.
- `multilingual`: Whether it is a multilingual model, it is turned off by default.
- `schema_lang`: select the language of the schema, optional `ch` and `en`. The default is `ch`, please select `en` for the English dataset.
<a name="25"></a>
### 2.5 Inference
Same with the pretrained models, you can use `paddlenlp.Taskflow` to load your custom model by specifying the path of the model weight file through `task_path`
```python
>>> from pprint import pprint
>>> from paddlenlp import Taskflow
>>> schema = {"武器名称": ["产国", "类型", "研发单位"]}
# Set the extraction target and the fine-tuned model path
>>> my_ie = Taskflow("information_extraction", schema=schema, task_path='./checkpoint/model_best')
>>> pprint(my_ie("威尔哥(Virgo)减速炸弹是由瑞典FFV军械公司专门为瑞典皇家空军的攻击机实施低空高速轰炸而研制,1956年开始研制,1963年进入服役,装备于A32“矛盾”、A35“龙”、和AJ134“雷”攻击机,主要用于攻击登陆艇、停放的飞机、高炮、野战火炮、轻型防护装甲车辆以及有生力量。"))
[{'武器名称': [{'end': 14,
'probability': 0.9998632702221926,
'relations': {'产国': [{'end': 18,
'probability': 0.9998815094394331,
'start': 16,
'text': '瑞典'}],
'研发单位': [{'end': 25,
'probability': 0.9995875123178521,
'start': 18,
'text': 'FFV军械公司'}],
'类型': [{'end': 14,
'probability': 0.999877336059086,
'start': 12,
'text': '炸弹'}]},
'start': 0,
'text': '威尔哥(Virgo)减速炸弹'}]}]
```
<a name="26"></a>
### 2.6 Experiments
| | Precision | Recall | F1 Score |
| :---: | :--------: | :--------: | :--------: |
| 0-shot | 0.64634| 0.53535 | 0.58564 |
| 5-shot | 0.89474 | 0.85000 | 0.87179 |
| 10-shot | 0.92793 | 0.85833 | 0.89177 |
| full-set | 0.93103 | 0.90000 | 0.91525 |
<a name="27"></a>
### 2.7 Closed Domain Distillation
Some industrial application scenarios have high inference performance requirements and the model cannot go into production without being effectively compressed. We built the [UIE Slim Data Distillation](./data_distill/README_en.md) with knowledge distillation techniques. The principle is to use the data as a bridge to transfer the knowledge of the UIE model to the smaller closed-domain information extraction model in order to achieve speedup inference significantly with minimal loss to accuracy.
# UIE Slim 数据蒸馏
在UIE强大的抽取能力背后,同样需要较大的算力支持计算。在一些工业应用场景中对性能的要求较高,若不能有效压缩则无法实际应用。因此,我们基于数据蒸馏技术构建了UIE Slim数据蒸馏系统。其原理是通过数据作为桥梁,将UIE模型的知识迁移到封闭域信息抽取小模型,以达到精度损失较小的情况下却能达到大幅度预测速度提升的效果。
#### UIE数据蒸馏三步
- **Step 1**: 使用UIE模型对标注数据进行finetune,得到Teacher Model。
- **Step 2**: 用户提供大规模无标注数据,需与标注数据同源。使用Taskflow UIE对无监督数据进行预测。
- **Step 3**: 使用标注数据以及步骤2得到的合成数据训练出封闭域Student Model。
## UIE Finetune
参考[UIE关系抽取微调](../README.md)完成模型微调,得到``../checkpoint/model_best``
## 离线蒸馏
#### 通过训练好的UIE定制模型预测无监督数据的标签
```shell
python data_distill.py \
--data_path ../data \
--save_dir student_data \
--task_type relation_extraction \
--synthetic_ratio 10 \
--model_path ../checkpoint/model_best
```
**NOTE**:schema需要根据标注数据在`data_distill.py`中进行配置,且schema需要包含标注数据中的所有标签类型。
可配置参数说明:
- `data_path`: 标注数据(`doccano_ext.json`)及无监督文本(`unlabeled_data.txt`)路径。
- `model_path`: 训练好的UIE定制模型路径。
- `save_dir`: 学生模型训练数据保存路径。
- `synthetic_ratio`: 控制合成数据的比例。最大合成数据数量=synthetic_ratio*标注数据数量。
- `platform`: 标注数据的所使用的标注平台,可选有`doccano``label_studio`,默认为`label_studio`
- `task_type`: 选择任务类型,可选有`entity_extraction``relation_extraction``event_extraction``opinion_extraction`。因为是封闭域抽取,不同任务的后处理逻辑不同,因此需指定任务类型。
- `seed`: 随机种子,默认为1000。
#### 老师模型评估
UIE微调阶段针对UIE训练格式数据评估模型效果(该评估方式非端到端评估,非关系抽取或事件抽取的标准评估方式),可通过以下评估脚本进行端到端评估。
```shell
python evaluate_teacher.py \
--task_type relation_extraction \
--test_path ./student_data/dev_data.json \
--label_maps_path ./student_data/label_maps.json \
--model_path ../checkpoint/model_best
```
可配置参数说明:
- `model_path`: 训练好的UIE定制模型路径。
- `test_path`: 测试数据集路径。
- `label_maps_path`: 学生模型标签字典。
- `batch_size`: 批处理大小,默认为8。
- `max_seq_len`: 最大文本长度,默认为256。
- `task_type`: 选择任务类型,可选有`entity_extraction``relation_extraction``event_extraction``opinion_extraction`。因为是封闭域信息抽取的评估,需指定任务类型。
#### 学生模型训练
```shell
python train.py \
--task_type relation_extraction \
--train_path student_data/train_data.json \
--dev_path student_data/dev_data.json \
--label_maps_path student_data/label_maps.json \
--num_epochs 50 \
--encoder ernie-3.0-mini-zh
```
可配置参数说明:
- `train_path`: 训练集文件路径。
- `dev_path`: 验证集文件路径。
- `batch_size`: 批处理大小,默认为16。
- `learning_rate`: 学习率,默认为3e-5。
- `save_dir`: 模型存储路径,默认为`./checkpoint`
- `max_seq_len`: 最大文本长度,默认为256。
- `weight_decay`: 表示AdamW优化器中使用的 weight_decay 的系数。
- `warmup_proportion`: 学习率warmup策略的比例,如果0.1,则学习率会在前10%训练step的过程中从0慢慢增长到learning_rate, 而后再缓慢衰减,默认为0.0。
- `num_epochs`: 训练轮数,默认为100。
- `seed`: 随机种子,默认为1000。
- `encoder`: 选择学生模型的模型底座,默认为`ernie-3.0-mini-zh`
- `task_type`: 选择任务类型,可选有`entity_extraction``relation_extraction``event_extraction``opinion_extraction`。因为是封闭域信息抽取,需指定任务类型。
- `logging_steps`: 日志打印的间隔steps数,默认10。
- `eval_steps`: evaluate的间隔steps数,默认200。
- `device`: 选用什么设备进行训练,可选cpu或gpu。
- `init_from_ckpt`: 可选,模型参数路径,热启动模型训练;默认为None。
#### 学生模型评估
```shell
python evaluate.py \
--model_path ./checkpoint/model_best \
--test_path student_data/dev_data.json \
--task_type relation_extraction \
--label_maps_path student_data/label_maps.json \
--encoder ernie-3.0-mini-zh
```
可配置参数说明:
- `model_path`: 训练好的UIE定制模型路径。
- `test_path`: 测试数据集路径。
- `label_maps_path`: 学生模型标签字典。
- `batch_size`: 批处理大小,默认为8。
- `max_seq_len`: 最大文本长度,默认为256。
- `encoder`: 选择学生模型的模型底座,默认为`ernie-3.0-mini-zh`
- `task_type`: 选择任务类型,可选有`entity_extraction``relation_extraction``event_extraction``opinion_extraction`。因为是封闭域信息抽取的评估,需指定任务类型。
## Taskflow部署学生模型
- 通过Taskflow一键部署封闭域信息抽取模型,`task_path`为学生模型路径。
```python
>>> from pprint import pprint
>>> from paddlenlp import Taskflow
>>> my_ie = Taskflow("information_extraction", model="uie-data-distill-gp", task_path="checkpoint/model_best/") # Schema is fixed in closed-domain information extraction
>>> pprint(my_ie("威尔哥(Virgo)减速炸弹是由瑞典FFV军械公司专门为瑞典皇家空军的攻击机实施低空高速轰炸而研制,1956年开始研制,1963年进入服役,装备于A32“矛盾”、A35“龙”、和AJ134“雷”攻击机,主要用于攻击登陆艇、停放的飞机、高炮、野战火炮、轻型防护装甲车辆以及有生力量。"))
[{'武器名称': [{'end': 14,
'probability': 0.9976037,
'relations': {'产国': [{'end': 18,
'probability': 0.9988706,
'relations': {},
'start': 16,
'text': '瑞典'}],
'研发单位': [{'end': 25,
'probability': 0.9978277,
'relations': {},
'start': 18,
'text': 'FFV军械公司'}],
'类型': [{'end': 14,
'probability': 0.99837446,
'relations': {},
'start': 12,
'text': '炸弹'}]},
'start': 0,
'text': '威尔哥(Virgo)减速炸弹'}]}]
```
# References
- **[GlobalPointer](https://kexue.fm/search/globalpointer/)**
- **[GPLinker](https://kexue.fm/archives/8888)**
- **[JunnYu/GPLinker_pytorch](https://github.com/JunnYu/GPLinker_pytorch)**
- **[CBLUE](https://github.com/CBLUEbenchmark/CBLUE)**
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment