This Information Extraction (IE) guide introduces our open-source industry-grade solution that covers the most widely-used application scenarios of Information Extraction. It features **multi-domain, multi-task, and cross-modal capabilities** and goes through the full lifecycle of **data labeling, model training and model deployment**. We hope this guide can help you apply Information Extraction techniques in your own products or models.
Information Extraction (IE) is the process of extracting structured information from given input data such as text, pictures or scanned document. While IE brings immense value, applying IE techniques is never easy with challenges such as domain adaptation, heterogeneous structures, lack of labeled data, etc. This PaddleNLP Information Extraction Guide builds on the foundation of our work in [Universal Information Extraction](https://arxiv.org/abs/2203.12277) and provides an industrial-level solution that not only supports **extracting entities, relations, events and opinions from plain text**, but also supports **cross-modal extraction out of documents, tables and pictures.** Our method features a flexible prompt, which allows you to specify extraction targets with simple natural language. We also provide a few different domain-adapted models specialized for different industry sectors.
**Highlights:**
-**Comprehensive Coverage🎓:** Covers various mainstream tasks of information extraction for plain text and document scenarios, supports multiple languages
-**State-of-the-Art Performance🏃:** Strong performance from the UIE model series models in plain text and multimodal datasets. We also provide pretrained models of various sizes to meet different needs
-**Easy to use⚡:** three lines of code to use our `Taskflow` for out-of-box Information Extraction capabilities. One line of command to model training and model deployment
-**Efficient Tuning✊:** Developers can easily get started with the data labeling and model training process without a background in Machine Learning.
<aname="2"></a>
## 2. Quick Start
For quick start, you can directly use ```paddlenlp.Taskflow``` out-of-the-box, leveraging the zero-shot capability. For production use cases, we recommend labeling a small amount of data for model fine-tuning to further improve the performance.
<aname="21"></a>
### 2.1 Code Structure
```shell
.
├── utils.py # data processing tools
├── finetune.py # model fine-tuning, compression script
├── evaluate.py # model evaluation script
└── README.md
```
<aname="22"></a>
### 2.2 Data Annotation
We recommend using [Label Studio](https://labelstud.io/) for data labeling. We provide an end-to-end pipeline for the labeling -> training process. You can export the labeled data in Label Studio through [label_studio.py](../label_studio.py) script to export and convert the data into the required input form for the model. For a detailed introduction to labeling methods, please refer to [Label Studio Data Labeling Guide](../label_studio_doc_en.md).
Here we provide the pre-labeled example dataset `VAT invoice dataset`, which you can download by running the following command. We will demonstrate how to use the data conversion script to generate training/validation/test set files for finetuning.
Generate training/validation set files, you can use PP-Structure's layout analysis to optimize the sorting of OCR results:
```shell
python ../label_studio.py \
--label_studio_file ./data/label_studio.json \
--save_dir ./data \
--splits 0.8 0.2 0\
--task_type ext\
--layout_analysis True
```
For more labeling rules and parameter descriptions for different types of tasks (including entity extraction, relationship extraction, document classification, etc.), please refer to [Label Studio Data Labeling Guide](../label_studio_doc_en.md).
<aname="23"></a>
### 2.3 Finetuning
Use the following command to fine-tune the model using `uie-x-base` as the pre-trained model, and save the fine-tuned model to `./checkpoint/model_best`:
Since the parameter `--do_eval` is set in the sample code, it will be automatically evaluated after training.
Parameters:
*`device`: Training device, one of 'cpu', 'gpu' and 'npu' can be selected; the default is GPU training.
*`logging_steps`: The interval steps of log printing during training, the default is 10.
*`save_steps`: The number of interval steps to save the model checkpoint during training, the default is 100.
*`eval_steps`: The number of interval steps to save the model checkpoint during training, the default is 100.
*`seed`: global random seed, default is 42.
*`model_name_or_path`: The pre-trained model used for few shot training. Defaults to "uie-x-base".
*`output_dir`: required, the model directory saved after model training or compression; the default is `None`.
*`train_path`: training set path; defaults to `None`.
*`dev_path`: Development set path; defaults to `None`.
*`max_seq_len`: The maximum segmentation length of the text. When the input exceeds the maximum length, the input text will be automatically segmented. The default is 512.
*`per_device_train_batch_size`: The batch size of each GPU core/NPU core/CPU used for training, the default is 8.
*`per_device_eval_batch_size`: Batch size per GPU core/NPU core/CPU for evaluation, default is 8.
*`num_train_epochs`: Training rounds, 100 can be selected when using early stopping method; the default is 10.
*`learning_rate`: The maximum learning rate for training, UIE-X recommends setting it to 1e-5; the default value is 3e-5.
*`label_names`: the name of the training data label label, UIE-X is set to 'start_positions' 'end_positions'; the default value is None.
*`do_train`: Whether to perform fine-tuning training, setting this parameter means to perform fine-tuning training, and it is not set by default.
*`do_eval`: Whether to evaluate, setting this parameter means to evaluate, the default is not set.
*`do_export`: Whether to export, setting this parameter means to export static images, and it is not set by default.
*`export_model_dir`: Static map export address, the default is None.
*`overwrite_output_dir`: If `True`, overwrite the contents of the output directory. If `output_dir` points to a checkpoint directory, use it to continue training.
*`disable_tqdm`: Whether to use tqdm progress bar.
*`metric_for_best_model`: Optimal model metric, UIE-X recommends setting it to `eval_f1`, the default is None.
*`load_best_model_at_end`: Whether to load the best model after training, usually used in conjunction with `metric_for_best_model`, the default is False.
*`save_total_limit`: If this parameter is set, the total number of checkpoints will be limited. Remove old checkpoints `output directory`, defaults to None.
<aname="24"></a>
### 2.4 Evaluation
```shell
python evaluate.py \
--device"gpu"\
--model_path ./checkpoint/model_best \
--test_path ./data/dev.txt \
--output_dir ./checkpoint/model_best \
--label_names'start_positions''end_positions'\
--max_seq_len 512 \
--per_device_eval_batch_size 16
```
We adopt the single-stage method for evaluation, which means tasks that require multiple stages (e.g. relation extraction, event extraction) are evaluated separately for each stage. By default, the validation/test set uses all labels at the same level to construct the negative examples.
The `debug` mode can be turned on to evaluate each positive category separately. This mode is only used for model debugging:
*`device`: Evaluation device, one of 'cpu', 'gpu' and 'npu' can be selected; the default is GPU evaluation.
*`model_path`: The path of the model folder for evaluation, which must contain the model weight file `model_state.pdparams` and the configuration file `model_config.json`.
*`test_path`: The test set file for evaluation.
*`label_names`: the name of the training data label, UIE-X is set to 'start_positions' 'end_positions'; the default value is None.
*`batch_size`: batch size, please adjust according to the machine situation, the default is 16.
*`max_seq_len`: The maximum segmentation length of the text. When the input exceeds the maximum length, the input text will be automatically segmented. The default is 512.
*`per_device_eval_batch_size`: Batch size per GPU core/NPU core/CPU for evaluation, default is 8.
*`debug`: Whether to enable the debug mode to evaluate each positive category separately. This mode is only used for model debugging and is disabled by default.
*`schema_lang`: Select the language of the schema, optional `ch` and `en`. The default is `ch`, please select `en` for the English dataset.
<aname="25"></a>
### 2.5 Inference
Same with the pretrained models, you can use `paddlenlp.Taskflow` to load your custom model by specifying the path of the model weight file through `task_path`
n-shot means that the training set contains n labeled image data for model fine-tuning. Experiments show that UIE-X can further improve the results through a small amount of data (few-shot) and PP-Structure layout analysis.
# Service deployment based on PaddleNLP SimpleServing
## Table of contents
-[Environment Preparation](#1)
-[Server](#2)
-[Client](#3)
-[Service Custom Parameters](#4)
<aname="1"></a>
## Environment Preparation
Use the PaddleNLP version with SimpleServing function (or the latest develop version)
```shell
pip install paddlenlp >= 2.4.4
```
<aname="2"></a>
## Server
```bash
paddlenlp server server:app --workers 1 --host 0.0.0.0 --port 8189
```
<aname="3"></a>
## Client
```bash
python client.py
```
<aname="4"></a>
## Service custom parameters
### Server Custom Parameters
#### schema replacement
```python
# Default schema
schema=['Billing Date','Name','Taxpayer Identification Number','Account Bank and Account Number','Amount','Total Price and Tax','No','Tax Rate','Address, Phone','tax']
PaddleNLP SimpleServing supports multi-card load balancing prediction, mainly during service registration, just register two Taskflow tasks, the following is the sample code
logger.info("Finished! It takes %.2f seconds"%(time.time()-tic_time))
if__name__=="__main__":
# yapf: disable
parser=argparse.ArgumentParser()
parser.add_argument("--label_studio_file",default="./data/label_studio.json",type=str,help="The annotation file exported from label studio platform.")
parser.add_argument("--save_dir",default="./data",type=str,help="The path of data that you wanna save.")
parser.add_argument("--negative_ratio",default=5,type=int,help="Used only for the extraction task, the ratio of positive and negative samples, number of negtive samples = negative_ratio * number of positive samples")
parser.add_argument("--splits",default=[0.8,0.1,0.1],type=float,nargs="*",help="The ratio of samples in datasets. [0.6, 0.2, 0.2] means 60% samples used for training, 20% for evaluation and 20% for test.")
parser.add_argument("--task_type",choices=['ext','cls'],default="ext",type=str,help="Select task type, ext for the extraction task and cls for the classification task, defaults to ext.")
parser.add_argument("--options",default=["正向","负向"],type=str,nargs="+",help="Used only for the classification task, the options for classification")
parser.add_argument("--prompt_prefix",default="情感倾向",type=str,help="Used only for the classification task, the prompt prefix for classification")
parser.add_argument("--is_shuffle",default="True",type=strtobool,help="Whether to shuffle the labeled dataset, defaults to True.")
parser.add_argument("--layout_analysis",default=False,type=bool,help="Enable layout analysis to optimize the order of OCR result.")
parser.add_argument("--seed",type=int,default=1000,help="Random seed for initialization")
parser.add_argument("--separator",type=str,default='##',help="Used only for entity/aspect-level classification task, separator for entity label and classification label")
parser.add_argument("--schema_lang",choices=["ch","en"],default="ch",help="Select the language type for schema.")
parser.add_argument("--ocr_lang",choices=["ch","en"],default="ch",help="Select the language type for OCR.")
# Label Studio User Guide - Document Information Extraction
**Table of contents**
-[1. Installation](#1)
-[2. Document Extraction Task Annotation](#2)
-[2.1 Project Creation](#21)
-[2.2 Data Upload](#22)
-[2.3 Label Construction](#23)
-[2.4 Task Annotation](#24)
-[2.5 Data Export](#25)
-[2.6 Data Conversion](#26)
-[2.7 More Configuration](#27)
<aname="1"></a>
## 1. Installation
**Environmental configuration used in the following annotation examples:**
- Python 3.8+
- label-studio == 1.6.0
- paddleocr >= 2.6.0.1
Use pip to install label-studio in the terminal:
```shell
pip install label-studio==1.6.0
```
Once the installation is complete, run the following command line:
```shell
label-studio start
```
Open [http://localhost:8080/](http://127.0.0.1:8080/) in the browser, enter the user name and password to log in, and start using label-studio for labeling.
<aname="2"></a>
## 2. Document Extraction Task Annotation
<aname="21"></a>
#### 2.1 Project Creation
Click Create to start creating a new project, fill in the project name, description, and select ``Object Detection with Bounding Boxes``.
After renaming the exported file to ``label_studio.json``, put it into the ``./document/data`` directory, and put the corresponding label image into the ``./document/data/images`` directory (The file name of the picture must be the same as the one uploaded to label studio). Through the [label_studio.py](./label_studio.py) script, it can be converted to the data format of UIE.
- Path example
```shell
./document/data/
├── images # image directory
│ ├── b0.jpg # Original picture (the file name must be the same as the one uploaded to label studio)
│ └── b1.jpg
└── label_studio.json # Annotation file exported from label studio
- ``label_studio_file``: Data labeling file exported from label studio.
- ``save_dir``: The storage directory of the training data, which is stored in the ``data`` directory by default.
- ``negative_ratio``: The maximum negative ratio. This parameter is only valid for extraction tasks. Properly constructing negative examples can improve the model effect. The number of negative examples is related to the actual number of labels, the maximum number of negative examples = negative_ratio * number of positive examples. This parameter is only valid for the training set, and the default is 5. In order to ensure the accuracy of the evaluation indicators, the verification set and test set are constructed with all negative examples by default.
- ``splits``: The proportion of training set and validation set when dividing the data set. The default is [0.8, 0.1, 0.1], which means that the data is divided into training set, verification set and test set according to the ratio of ``8:1:1``.
- ``task_type``: Select the task type, there are two types of tasks: extraction and classification.
- ``options``: Specify the category label of the classification task, this parameter is only valid for the classification type task. Defaults to ["positive", "negative"].
- ``prompt_prefix``: Declare the prompt prefix information of the classification task, this parameter is only valid for the classification type task. Defaults to "Sentimental Tendency".
- ``is_shuffle``: Whether to randomly shuffle the data set, the default is True.
- ``seed``: random seed, default is 1000.
- ``separator``: The separator between entity category/evaluation dimension and classification label. This parameter is only valid for entity/evaluation dimension classification tasks. The default is"##".
- ``schema_lang``: Select the language of the schema, which will be the construction method of the training data prompt, optional `ch` and `en`. Defaults to `ch`.
- ``ocr_lang``: Select the language for OCR, optional `ch` and `en`. Defaults to `ch`.
- ``layout_analysis``: Whether to use PPStructure to analyze the layout of the document. This parameter is only valid for document type labeling tasks. The default is False.
Note:
- By default the [label_studio.py](./label_studio.py) script will divide the data proportionally into train/dev/test datasets
- Each time the [label_studio.py](./label_studio.py) script is executed, the existing data file with the same name will be overwritten
- In the model training phase, we recommend constructing some negative examples to improve the model performance, and we have built-in this function in the data conversion phase. The proportion of automatically constructed negative samples can be controlled by `negative_ratio`; the number of negative samples = negative_ratio * the number of positive samples.
- For files exported from label_studio, each piece of data in the default file is correctly labeled manually.
# Label Studio User Guide - Text Information Extraction
**Table of contents**
-[1. Installation](#1)
-[2. Text Extraction Task Annotation](#2)
-[2.1 Project Creation](#21)
-[2.2 Data Upload](#22)
-[2.3 Label Construction](#23)
-[2.4 Task Annotation](#24)
-[2.5 Data Export](#25)
-[2.6 Data Conversion](#26)
-[2.7 More Configuration](#27)
<aname="1"></a>
## 1. Installation
**Environmental configuration used in the following annotation examples:**
- Python 3.8+
- label-studio == 1.6.0
- paddleocr >= 2.6.0.1
Use pip to install label-studio in the terminal:
```shell
pip install label-studio==1.6.0
```
Once the installation is complete, run the following command line:
```shell
label-studio start
```
Open [http://localhost:8080/](http://127.0.0.1:8080/) in the browser, enter the user name and password to log in, and start using label-studio for labeling.
<aname="2"></a>
## 2. Text extraction task annotation
<aname="21"></a>
#### 2.1 Project Creation
Click Create to start creating a new project, fill in the project name, description, and select ``Object Detection with Bounding Boxes``.
For relation extraction, the type setting of P is very important, and the following principles need to be followed
"{P} of {S} is {O}" needs to be able to form a semantically reasonable phrase. For example, for a triple (S, father and son, O), there is no problem with the relation category being father and son. However, according to the current structure of the UIE relation type prompt, the expression "the father and son of S is O" is not very smooth, so it is better to change P to child, that is, "child of S is O". **A reasonable P type setting will significantly improve the zero-shot performance**.
The schema corresponding to this annotation example is:
Rename the exported file to ``label_studio.json`` and put it in the ``./data`` directory. Through the [label_studio.py](./label_studio.py) script, it can be converted to the data format of UIE.
- Extraction task
```shell
python label_studio.py\
--label_studio_file ./data/label_studio.json \
--save_dir ./data \
--splits 0.8 0.1 0.1 \
--task_type ext
```
- Sentence-level classification tasks
In the data conversion stage, we will automatically construct prompt information for model training. For example, in sentence-level sentiment classification, the prompt is ``Sentiment Classification [positive, negative]``, which can be configured through `prompt_prefix` and `options` parameters.
```shell
python label_studio.py\
--label_studio_file ./data/label_studio.json \
--task_type cls \
--save_dir ./data \
--splits 0.8 0.1 0.1 \
--prompt_prefix "Sentiment Classification" \
--options "positive" "negative"
```
- Opinion Extraction
In the data conversion stage, we will automatically construct prompt information for model training. For example, in the emotional classification of the evaluation dimension, the prompt is ``Sentiment Classification of xxx [positive, negative]``, which can be declared through the `prompt_prefix` and `options` parameters.
```shell
python label_studio.py\
--label_studio_file ./data/label_studio.json \
--task_type ext \
--save_dir ./data \
--splits 0.8 0.1 0.1 \
--prompt_prefix "Sentiment Classification" \
--options "positive" "negative" \
--separator "##"
```
<a name="27"></a>
#### 2.7 More Configuration
- ``label_studio_file``: Data labeling file exported from label studio.
- ``save_dir``: The storage directory of the training data, which is stored in the ``data`` directory by default.
- ``negative_ratio``: The maximum negative ratio. This parameter is only valid for extraction tasks. Properly constructing negative examples can improve the model effect. The number of negative examples is related to the actual number of labels, the maximum number of negative examples = negative_ratio * number of positive examples. This parameter is only valid for the training set, and the default is 5. In order to ensure the accuracy of the evaluation indicators, the verification set and test set are constructed with all negative examples by default.
- ``splits``: The proportion of training set and validation set when dividing the data set. The default is [0.8, 0.1, 0.1], which means that the data is divided into training set, verification set and test set according to the ratio of ``8:1:1``.
- ``task_type``: Select the task type, there are two types of tasks: extraction and classification.
- ``options``: Specify the category label of the classification task, this parameter is only valid for the classification type task. Defaults to ["positive", "negative"].
- ``prompt_prefix``: Declare the prompt prefix information of the classification task, this parameter is only valid for the classification type task. Defaults to "Sentimental Tendency".
- ``is_shuffle``: Whether to randomly shuffle the data set, the default is True.
- ``seed``: random seed, default is 1000.
- ``schema_lang``: Select the language of the schema, which will be the construction method of the training data prompt, optional `ch` and `en`. Defaults to `ch`.
- ``separator``: The separator between entity category/evaluation dimension and classification label. This parameter is only valid for entity/evaluation dimension classification tasks. The default is"##".
Note:
- By default the [label_studio.py](./label_studio.py) script will divide the data proportionally into train/dev/test datasets
- Each time the [label_studio.py](./label_studio.py) script is executed, the existing data file with the same name will be overwritten
- In the model training phase, we recommend constructing some negative examples to improve the model performance, and we have built-in this function in the data conversion phase. The proportion of automatically constructed negative samples can be controlled by `negative_ratio`; the number of negative samples = negative_ratio * the number of positive samples.
- For files exported from label_studio, each piece of data in the default file is correctly labeled manually.
```paddlenlp.Taskflow``` provides general information extraction of text and documents, evaluation opinion extraction and other capabilities, and can extract various types of information, including but not limited to named entities (such as person name, place name, organization name, etc.), relations (such as the director of the movie, the release time of the song, etc.), events (such as a car accident at a certain intersection, an earthquake in a certain place, etc.), and information such as product reviews, opinions, and sentiments. Users can use natural language to customize the extraction target, and can uniformly extract the corresponding information in the input text or document without training.
<a name="2"></a>
## 2. Document Information Extraction
This section introduces the document extraction capability of Taskflow with the following example picture [download link](https://bj.bcebos.com/paddlenlp/taskflow/information_extraction/cases.zip).
<a name="21"></a>
#### 2.1 Entity Extraction
Entity extraction, also known as Named Entity Recognition (NER for short), refers to identifying entities with specific meanings in text. UIE adopts the open-domain approach where the entity category is not fixed and the users can define them by through natural language.
Relation Extraction refers to identifying entities from text and extracting the semantic relationship between entities, and then obtaining triple information, namely <subject, predicate, object>.
For document information extraction, UIE-X supports image paths, http image links, base64 input form, and image and PDF document formats. In the input dict, `text` indicates text input and `doc` refer to the document input.
**NOTE**: Multi-page PDF input currently only extracts the results of the first page. UIE-X is more suitable for information extraction of document documents (such as bills, receipts, etc.), but it is not suitable for documents that are too long or multi-page.
- Using custom OCR input
```python
layout = [
([68.0, 12.0, 167.0, 70.0], '名次'),
([464.0, 13.0, 559.0, 67.0], '球员'),
([833.0, 15.0, 1054.0, 64.0], '总出场时间'),
......
]
ie({"doc": doc_path, 'layout': layout})
```
<a name="25"></a>
#### 2.5 Tips
- Using PP-Structure layout analysis function
The text recognized in OCR will be sorted from top left to bottom right. For cases such as column division and multiple lines of text in the table, we recommend using the layout analysis function ``layout_analysis=True`` to optimize text sorting and enhance the extraction effect. The following example is only an example of the usage scenario of the layout analysis function, and the actual scenario generally needs to be marked and fine-tuned.
* `schema`: Define the task extraction target, which can be configured by referring to the calling examples of different tasks in the out-of-the-box.
* `schema_lang`: Set the language of the schema, the default is `ch`, optional `ch` and `en`. Because the structure of the Chinese and English schemas is different, the language of the schema needs to be specified.
* `ocr_lang`: Select the language of PaddleOCR, `ch` can be used in mixed Chinese and English images, `en` works better on English images, the default is `ch`.
* `batch_size`: batch size, please adjust according to the machine situation, the default is 16.
* `model`: select the model used by the task, the default is `uie-base`, optional `uie-base`, `uie-medium`, `uie-mini`, `uie-micro`, `uie-nano` ` and `uie-medical-base`, `uie-base-en`, `uie-x-base`.
* `layout_analysis`: Whether to use PP-Structure to analyze the layout of the document to optimize the sorting of layout information, the default is False.
* `position_prob`: The result probability of the model for the start position/end position of the span is between 0 and 1, and the returned result removes the results less than this threshold, the default is 0.5, and the final probability output of the span is the start position probability and end position The product of the position probabilities.
* `precision`: select the model precision, the default is `fp32`, optional `fp16` and `fp32`. `fp16` inference is faster, support GPU and NPU hardware. If you choose `fp16` and GPU hardware, please ensure that the machine is correctly installed with NVIDIA-related drivers and basic software. **Ensure that CUDA>=11.2, cuDNN>=8.1.1**. For the first time use, you need to follow the prompts to install the relevant dependencies. Secondly, it is necessary to ensure that the CUDA Compute Capability of the GPU device is greater than 7.0. Typical devices include V100, T4, A10, A100, GTX 20 series and 30 series graphics cards, etc. For more information about CUDA Compute Capability and precision support, please refer to NVIDIA documentation: [GPU Hardware and Supported Precision Comparison Table](https://docs.nvidia.com/deeplearning/tensorrt/archives/tensorrt-840-ea/support-matrix/index.html#hardware-precision-matrix).
* `use_fast`: Use the high-performance word segmentation operator FastTokenizer implemented in C++ to accelerate text preprocessing. The FastTokenizer library needs to be installed through `pip install fast-tokenizer-python` before it can be used. Defaults to `False`. For more usage instructions, please refer to [FastTokenizer Documentation](../../fast_tokenizer).
# UIE Taskflow User Guide - Text Information Extraction
**Table of contents**
-[1. Introduction](#1)
-[2. Examples](#2)
-[3. Text Information Extraction](#3)
-[3.1 Entity Extraction](#31)
-[3.2 Relation Extraction](#32)
-[3.3 Event Extraction](#33)
-[3.4 Opinion Extraction](#34)
-[3.5 Sentiment Classification](#35)
-[3.6 Multi-task Extraction](#36)
-[3.7 Available Models](#37)
-[3.8 More Configuration](#38)
<aname="1"></a>
## 1. Introduction
```paddlenlp.Taskflow``` provides general information extraction of text and documents, evaluation opinion extraction and other capabilities, and can extract various types of information, including but not limited to named entities (such as person name, place name, organization name, etc.), relations (such as the director of the movie, the release time of the song, etc.), events (such as a car accident at a certain intersection, an earthquake in a certain place, etc.), and information such as product reviews, opinions, and sentiments. Users can use natural language to customize the extraction target, and can uniformly extract the corresponding information in the input text or document without training.
<a name="2"></a>
## 2. Examples
UIE does not limit industry fields and extraction targets. The following are some industry examples implemented out of the box by Taskflow:
- Medical scenarios - specialized disease structure
Entity extraction, also known as Named Entity Recognition (NER for short), refers to identifying entities with specific meanings in text. In the open domain information extraction, the extracted categories are not limited, and users can define them by themselves.
- For example, the extracted target entity types are "person" and "organization", and the schema defined as follows:
>>> pprint(ie_en('In 1997, Steve was excited to become the CEO of Apple.'))
[{'Organization': [{'end': 53,
'probability': 0.9985840259877357,
'start': 48,
'text': 'Apple'}],
'Person': [{'end': 14,
'probability': 0.999631971804547,
'start': 9,
'text': 'Steve'}]}]
```
<a name="32"></a>
#### 3.2 Relationship Extraction
Relation Extraction refers to identifying entities from text and extracting the semantic relationship between entities, and then obtaining triple information, namely <subject, predicate, object>.
- For example, if "person" is used as the extraction subject, and the extraction relationship types are "Company" and "Position", the schema structure is as follows:
>>> pprint(ie_en('In 1997, Steve was excited to become the CEO of Apple.'))
[{'Person': [{'end': 14,
'probability': 0.999631971804547,
'relations': {'Company': [{'end': 53,
'probability': 0.9960158209451642,
'start': 48,
'text': 'Apple'}],
'Position': [{'end': 44,
'probability': 0.8871063806420736,
'start': 41,
'text': 'CEO'}]},
'start': 9,
'text': 'Steve'}]}]
```
<a name="33"></a>
#### 3.3 Event extraction
Event Extraction refers to extracting predefined event trigger words (Trigger) and event arguments (Argument) from natural language texts, and combining them into corresponding event structured information.
- The English model** does not support event extraction**, if necessary, it can be customized using the English event dataset.
<a name="34"></a>
#### 3.4 Opinion Extraction
Opinion extraction refers to the extraction of evaluation dimensions and opinion words contained in the text.
- For example, the target of extraction is the evaluation dimension contained in the text and its corresponding opinion words and emotional tendencies. The schema structure is as follows:
- Sentence-level sentiment classification, that is, to judge whether the emotional orientation of a sentence is "positive" or "negative". The schema structure is as follows:
- For example, in the legal scene, entity extraction and relation extraction are performed on the text at the same time, and the schema can be constructed as follows:
>>> pprint(ie(["2月8日上午北京冬奥会自由式滑雪女子大跳台决赛中中国选手谷爱凌以188.25分获得金牌!", "Rafael Nadal wins French Open Final!"]))
[{'Competition': [{'end': 23,
'probability': 0.9373889907291257,
'start': 6,
'text': '北京冬奥会自由式滑雪女子大跳台决赛'}],
'Player': [{'end': 31,
'probability': 0.6981119555336441,
'start': 28,
'text': '谷爱凌'}],
'Score': [{'end': 39,
'probability': 0.9888507878270296,
'start': 32,
'text': '188.25分'}],
'Time': [{'end': 6,
'probability': 0.9784080036931151,
'start': 0,
'text': '2月8日上午'}]},
{'Competition': [{'end': 35,
'probability': 0.9851549932171295,
'start': 18,
'text': 'French Open Final'}],
'Player': [{'end': 12,
'probability': 0.9379371275888104,
'start': 0,
'text': 'Rafael Nadal'}]}]
```
<a name="38"></a>
#### 3.8 More Configuration
```python
>>> from paddlenlp import Taskflow
>>> ie = Taskflow('information_extraction',
schema="",
schema_lang="ch",
batch_size=16,
model='uie-base',
position_prob=0.5,
precision='fp32',
use_fast=False)
```
* `schema`: Define the task extraction target, which can be configured by referring to the calling examples of different tasks in the out-of-the-box.
* `schema_lang`: Set the language of the schema, the default is `ch`, optional `ch` and `en`. Because the structure of the Chinese and English schemas is different, the language of the schema needs to be specified. This parameter is only valid for `uie-x-base`, `uie-m-base` and `uie-m-large` models.
* `batch_size`: batch size, please adjust according to the machine situation, the default is 16.
* `model`: select the model used by the task, the default is `uie-base`, optional `uie-base`, `uie-medium`, `uie-mini`, `uie-micro`, `uie-nano` and `uie-medical-base`, `uie-base-en`, `uie-x-base`.
* `position_prob`: The result probability of the model for the start position/end position of the span is between 0 and 1, and the returned result removes the results less than this threshold, the default is 0.5, and the final probability output of the span is the start position probability and end position The product of the position probabilities.
* `precision`: select the model precision, the default is `fp32`, optional `fp16` and `fp32`. `fp16` inference is faster, support GPU and NPU hardware. If you choose `fp16` and GPU hardware, please ensure that the machine is correctly installed with NVIDIA-related drivers and basic software. **Ensure that CUDA>=11.2, cuDNN>=8.1.1**. For the first time use, you need to follow the prompts to install the relevant dependencies. Secondly, it is necessary to ensure that the CUDA Compute Capability of the GPU device is greater than 7.0. Typical devices include V100, T4, A10, A100, GTX 20 series and 30 series graphics cards, etc. For more information about CUDA Compute Capability and precision support, please refer to NVIDIA documentation: [GPU Hardware and Supported Precision Comparison Table](https://docs.nvidia.com/deeplearning/tensorrt/archives/tensorrt-840-ea/support-matrix/index.html#hardware-precision-matrix).
* `use_fast`: Use the high-performance word segmentation operator FastTokenizer implemented in C++ to accelerate text preprocessing. The FastTokenizer library needs to be installed through `pip install fast-tokenizer-python` before it can be used. Defaults to `False`. For more usage instructions, please refer to [FastTokenizer Documentation](../../fast_tokenizer).
推荐使用 [Trainer API ](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/docs/trainer.md) 对模型进行微调。只需输入模型、数据集等就可以使用 Trainer API 高效快速地进行预训练、微调和模型压缩等任务,可以一键启动多卡训练、混合精度训练、梯度累积、断点重启、日志显示等功能,Trainer API 还针对训练过程的通用训练配置做了封装,比如:优化器、学习率调度等。
This project provides an end-to-end application solution for plain text extraction based on UIE fine-tuning and goes through the full lifecycle of **data labeling, model training and model deployment**. We hope this guide can help you apply Information Extraction techniques in your own products or models.a
Information Extraction (IE) is the process of extracting structured information from given input data such as text, pictures or scanned document. While IE brings immense value, applying IE techniques is never easy with challenges such as domain adaptation, heterogeneous structures, lack of labeled data, etc. This PaddleNLP Information Extraction Guide builds on the foundation of our work in [Universal Information Extraction](https://arxiv.org/abs/2203.12277) and provides an industrial-level solution that not only supports **extracting entities, relations, events and opinions from plain text**, but also supports **cross-modal extraction out of documents, tables and pictures.** Our method features a flexible prompt, which allows you to specify extraction targets with simple natural language. We also provide a few different domain-adapated models specialized for different industry sectors.
**Highlights:**
-**Comprehensive Coverage🎓:** Covers various mainstream tasks of information extraction for plain text and document scenarios, supports multiple languages
-**State-of-the-Art Performance🏃:** Strong performance from the UIE model series models in plain text and multimodal datasets. We also provide pretrained models of various sizes to meet different needs
-**Easy to use⚡:** three lines of code to use our `Taskflow` for out-of-box Information Extraction capabilities. One line of command to model training and model deployment
-**Efficient Tuning✊:** Developers can easily get started with the data labeling and model training process without a background in Machine Learning.
<aname="2"></a>
## 2. Quick start
For quick start, you can directly use ```paddlenlp.Taskflow``` out-of-the-box, leveraging the zero-shot performance. For production use cases, we recommend labeling a small amount of data for model fine-tuning to further improve the performance.
<aname="21"></a>
### 2.1 Code structure
```shell
.
├── utils.py # data processing tools
├── finetune.py # model fine-tuning, compression script
├── evaluate.py # model evaluation script
└── README.md
```
<aname="22"></a>
### 2.2 Data labeling
We recommend using [Label Studio](https://labelstud.io/) for data labeling. We provide an end-to-end pipeline for the labeling -> training process. You can export the labeled data in Label Studio through [label_studio.py](../label_studio.py) script to export and convert the data into the required input form for the model. For a detailed introduction to labeling methods, please refer to [Label Studio Data Labeling Guide](../label_studio_text_en.md).
Here we provide a pre-labeled example dataset `Military Relationship Extraction Dataset`, which you can download with the following command. We will show how to use the data conversion script to generate training/validation/test set files for fine-tuning .
Download the military relationship extraction dataset:
For more labeling rules and parameter descriptions for different types of tasks (including entity extraction, relationship extraction, document classification, etc.), please refer to [Label Studio Data Labeling Guide](../label_studio_text_en.md).
<aname="23"></a>
### 2.3 Finetuning
Use the following command to fine-tune the model using `uie-base` as the pre-trained model, and save the fine-tuned model to `$finetuned_model`:
*`device`: Training device, one of 'cpu', 'gpu' and 'npu' can be selected; the default is GPU training.
*`logging_steps`: The interval steps of log printing during training, the default is 10.
*`save_steps`: The number of interval steps to save the model checkpoint during training, the default is 100.
*`eval_steps`: The number of interval steps to save the model checkpoint during training, the default is 100.
*`seed`: global random seed, default is 42.
*`model_name_or_path`: The pre-trained model used for few shot training. Defaults to "uie-x-base".
*`output_dir`: required, the model directory saved after model training or compression; the default is `None`.
*`train_path`: training set path; defaults to `None`.
*`dev_path`: Development set path; defaults to `None`.
*`max_seq_len`: The maximum segmentation length of the text. When the input exceeds the maximum length, the input text will be automatically segmented. The default is 512.
*`per_device_train_batch_size`: The batch size of each GPU core//NPU core/CPU used for training, the default is 8.
*`per_device_eval_batch_size`: Batch size per GPU core/NPU core/CPU for evaluation, default is 8.
*`num_train_epochs`: Training rounds, 100 can be selected when using early stopping method; the default is 10.
*`learning_rate`: The maximum learning rate for training, UIE-X recommends setting it to 1e-5; the default value is 3e-5.
*`label_names`: the name of the training data label, UIE-X is set to 'start_positions' 'end_positions'; the default value is None.
*`do_train`: Whether to perform fine-tuning training, setting this parameter means to perform fine-tuning training, and it is not set by default.
*`do_eval`: Whether to evaluate, setting this parameter means to evaluate, the default is not set.
*`do_export`: Whether to export, setting this parameter means to export static images, and it is not set by default.
*`export_model_dir`: Static map export address, the default is None.
*`overwrite_output_dir`: If `True`, overwrite the contents of the output directory. If `output_dir` points to a checkpoint directory, use it to continue training.
*`disable_tqdm`: Whether to use tqdm progress bar.
*`metric_for_best_model`: Optimal model metric, UIE-X recommends setting it to `eval_f1`, the default is None.
*`load_best_model_at_end`: Whether to load the best model after training, usually used in conjunction with `metric_for_best_model`, the default is False.
*`save_total_limit`: If this parameter is set, the total number of checkpoints will be limited. Remove old checkpoints `output directory`, defaults to None.
<aname="24"></a>
### 2.4 Evaluation
Model evaluation:
```shell
python evaluate.py \
--model_path ./checkpoint/model_best \
--test_path ./data/dev.txt \
--batch_size 16 \
--max_seq_len 512
```
Model evaluation for UIE-M:
```
python evaluate.py \
--model_path ./checkpoint/model_best \
--test_path ./data/dev.txt \
--batch_size 16 \
--max_seq_len 512 \
--multilingual
```
We adopt the single-stage method for evaluation, which means tasks that require multiple stages (e.g. relation extraction, event extraction) are evaluated separately for each stage. By default, the validation/test set uses all labels at the same level to construct the negative examples.
The `debug` mode can be turned on to evaluate each positive category separately. This mode is only used for model debugging:
-`device`: Evaluation device, one of 'cpu', 'gpu' and 'npu' can be selected; the default is GPU evaluation.
-`model_path`: The path of the model folder for evaluation, which must contain the model weight file `model_state.pdparams` and the configuration file `model_config.json`.
-`test_path`: The test set file for evaluation.
-`batch_size`: batch size, please adjust according to the machine situation, the default is 16.
-`max_seq_len`: The maximum segmentation length of the text. When the input exceeds the maximum length, the input text will be automatically segmented. The default is 512.
-`debug`: Whether to enable the debug mode to evaluate each positive category separately. This mode is only used for model debugging and is disabled by default.
-`multilingual`: Whether it is a multilingual model, it is turned off by default.
-`schema_lang`: select the language of the schema, optional `ch` and `en`. The default is `ch`, please select `en` for the English dataset.
<aname="25"></a>
### 2.5 Inference
Same with the pretrained models, you can use `paddlenlp.Taskflow` to load your custom model by specifying the path of the model weight file through `task_path`
```python
>>>frompprintimportpprint
>>>frompaddlenlpimportTaskflow
>>>schema={"武器名称":["产国","类型","研发单位"]}
# Set the extraction target and the fine-tuned model path
Some industrial application scenarios have high inference performance requirements and the model cannot go into production without being effectively compressed. We built the [UIE Slim Data Distillation](./data_distill/README_en.md) with knowledge distillation techniques. The principle is to use the data as a bridge to transfer the knowledge of the UIE model to the smaller closed-domain information extraction model in order to achieve speedup inference significantly with minimal loss to accuracy.
>>>my_ie=Taskflow("information_extraction",model="uie-data-distill-gp",task_path="checkpoint/model_best/")# Schema is fixed in closed-domain information extraction