"tests/models/vscode:/vscode.git/clone" did not exist on "ce85686a1f425c8e60d9104522d8626395dd507d"
Commit 10f294ff authored by yuguo-Jack's avatar yuguo-Jack
Browse files

llama_paddle

parent 7c64e6ec
Pipeline #678 failed with stages
in 0 seconds
#!/bin/bash
# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
export CUDA_VISIBLE_DEVICES=0
QUESTION=$1
if [ ! -d output ]; then
mkdir output
fi
if [ ! -d log ]; then
mkdir log
fi
python3 change_to_rerank.py ${QUESTION}
python3 -u ./src/train_ce.py \
--use_cuda true \
--verbose true \
--do_train false \
--do_val false \
--do_test true \
--batch_size 128 \
--init_checkpoint "./checkpoints/ranker" \
--test_set "./data/demo.tsv" \
--test_save "data/demo.score" \
--max_seq_len 384 \
--for_cn true \
--vocab_path "config/ernie_base_1.0_CN/vocab.txt" \
--ernie_config_path "config/ernie_base_1.0_CN/ernie_config.json"
1>>log/train.log 2>&1
#!/bin/bash
# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
export CUDA_VISIBLE_DEVICES=0
if [ $# != 4 ];then
echo "USAGE: sh run_train.sh \$TRAIN_SET \$MODEL_PATH \$epoch \$nodes_count"
exit 1
fi
TRAIN_SET=$1
MODEL_PATH=$2
epoch=$3
node=$4
CHECKPOINT_PATH=output
if [ ! -d output ]; then
mkdir output
fi
if [ ! -d log ]; then
mkdir log
fi
lr=1e-5
batch_size=32
train_exampls=`cat $TRAIN_SET | wc -l`
save_steps=$[$train_exampls/$batch_size/$node]
data_size=$[$save_steps*$batch_size*$node]
new_save_steps=$[$save_steps*$epoch/2]
python3 -m paddle.distributed.launch \
--log_dir log \
./src/train_ce.py \
--use_cuda true \
--verbose true \
--do_train true \
--do_val false \
--do_test false \
--use_mix_precision false \
--train_data_size ${data_size} \
--batch_size ${batch_size} \
--init_pretraining_params ${MODEL_PATH} \
--train_set ${TRAIN_SET} \
--save_steps ${new_save_steps} \
--validation_steps ${new_save_steps} \
--checkpoints ${CHECKPOINT_PATH} \
--weight_decay 0.01 \
--warmup_proportion 0.0 \
--epoch $epoch \
--max_seq_len 384 \
--for_cn true \
--vocab_path config/ernie_base_1.0_CN/vocab.txt \
--ernie_config_path config/ernie_base_1.0_CN/ernie_config.json \
--learning_rate ${lr} \
--skip_steps 10 \
--num_iteration_per_drop_scope 1 \
--num_labels 2 \
--random_seed 1
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Mask, padding and batching."""
import numpy as np
def pad_batch_data(
insts,
pad_idx=0,
return_pos=False,
return_input_mask=False,
return_max_len=False,
return_num_token=False,
return_seq_lens=False,
):
"""
Pad the instances to the max sequence length in batch, and generate the
corresponding position data and attention bias.
"""
return_list = []
max_len = max(len(inst) for inst in insts)
# Any token included in dict can be used to pad, since the paddings' loss
# will be masked out by weights and make no effect on parameter gradients.
inst_data = np.array([inst + list([pad_idx] * (max_len - len(inst))) for inst in insts])
return_list += [inst_data.astype("int64").reshape([-1, max_len, 1])]
# position data
if return_pos:
inst_pos = np.array([list(range(0, len(inst))) + [pad_idx] * (max_len - len(inst)) for inst in insts])
return_list += [inst_pos.astype("int64").reshape([-1, max_len, 1])]
if return_input_mask:
# This is used to avoid attention on paddings.
input_mask_data = np.array([[1] * len(inst) + [0] * (max_len - len(inst)) for inst in insts])
input_mask_data = np.expand_dims(input_mask_data, axis=-1)
return_list += [input_mask_data.astype("float32")]
if return_max_len:
return_list += [max_len]
if return_num_token:
num_token = 0
for inst in insts:
num_token += len(inst)
return_list += [num_token]
if return_seq_lens:
seq_lens = np.array([len(inst) for inst in insts])
return_list += [seq_lens.astype("int64").reshape([-1, 1])]
return return_list if len(return_list) > 1 else return_list[0]
if __name__ == "__main__":
pass
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Model for classifier."""
import logging
import time
import numpy as np
import paddle.fluid as fluid
from model.ernie import ErnieModel
from scipy.stats import pearsonr, spearmanr
log = logging.getLogger(__name__)
def create_model(args, pyreader_name, ernie_config, is_prediction=False, task_name=""):
pyreader = fluid.layers.py_reader(
capacity=50,
shapes=[
[-1, args.max_seq_len, 1],
[-1, args.max_seq_len, 1],
[-1, args.max_seq_len, 1],
[-1, args.max_seq_len, 1],
[-1, args.max_seq_len, 1],
[-1, 1],
[-1, 1],
],
dtypes=["int64", "int64", "int64", "int64", "float32", "int64", "int64"],
lod_levels=[0, 0, 0, 0, 0, 0, 0],
name=task_name + "_" + pyreader_name,
use_double_buffer=True,
)
(src_ids, sent_ids, pos_ids, task_ids, input_mask, labels, qids) = fluid.layers.read_file(pyreader)
def _model(is_noise=False):
ernie = ErnieModel(
src_ids=src_ids,
position_ids=pos_ids,
sentence_ids=sent_ids,
task_ids=task_ids,
input_mask=input_mask,
config=ernie_config,
is_noise=is_noise,
)
cls_feats = ernie.get_pooled_output()
if not is_noise:
cls_feats = fluid.layers.dropout(x=cls_feats, dropout_prob=0.1, dropout_implementation="upscale_in_train")
logits = fluid.layers.fc(
input=cls_feats,
size=args.num_labels,
param_attr=fluid.ParamAttr(
name=task_name + "_cls_out_w", initializer=fluid.initializer.TruncatedNormal(scale=0.02)
),
bias_attr=fluid.ParamAttr(name=task_name + "_cls_out_b", initializer=fluid.initializer.Constant(0.0)),
)
"""
if is_prediction:
probs = fluid.layers.softmax(logits)
feed_targets_name = [
src_ids.name, sent_ids.name, pos_ids.name, input_mask.name
]
if ernie_version == "2.0":
feed_targets_name += [task_ids.name]
return pyreader, probs, feed_targets_name
"""
num_seqs = fluid.layers.create_tensor(dtype="int64")
# add focal loss
ce_loss, probs = fluid.layers.softmax_with_cross_entropy(logits=logits, label=labels, return_softmax=True)
loss = fluid.layers.mean(x=ce_loss)
accuracy = fluid.layers.accuracy(input=probs, label=labels, total=num_seqs)
graph_vars = {
"loss": loss,
"probs": probs,
"accuracy": accuracy,
"labels": labels,
"num_seqs": num_seqs,
"qids": qids,
}
return graph_vars
if not is_prediction:
graph_vars = _model(is_noise=True)
old_loss = graph_vars["loss"]
token_emb = fluid.default_main_program().global_block().var("word_embedding")
token_emb.stop_gradient = False
token_gradient = fluid.gradients(old_loss, token_emb)[0]
token_gradient.stop_gradient = False
epsilon = 1e-8
norm = fluid.layers.sqrt(fluid.layers.reduce_sum(fluid.layers.square(token_gradient)) + epsilon)
gp = (0.01 * token_gradient) / norm
gp.stop_gradient = True
fluid.layers.assign(token_emb + gp, token_emb)
graph_vars = _model()
fluid.layers.assign(token_emb - gp, token_emb)
else:
graph_vars = _model()
return pyreader, graph_vars
def evaluate_mrr(preds):
last_qid = None
total_mrr = 0.0
qnum = 0.0
rank = 0.0
correct = False
for qid, score, label in preds:
if qid != last_qid:
rank = 0.0
qnum += 1
correct = False
last_qid = qid
rank += 1
if not correct and label != 0:
total_mrr += 1.0 / rank
correct = True
return total_mrr / qnum
def evaluate(
exe, test_program, test_pyreader, graph_vars, eval_phase, use_multi_gpu_test=False, metric="simple_accuracy"
):
train_fetch_list = [graph_vars["loss"].name, graph_vars["accuracy"].name, graph_vars["num_seqs"].name]
if eval_phase == "train":
if "learning_rate" in graph_vars:
train_fetch_list.append(graph_vars["learning_rate"].name)
outputs = exe.run(fetch_list=train_fetch_list, program=test_program)
ret = {"loss": np.mean(outputs[0]), "accuracy": np.mean(outputs[1])}
if "learning_rate" in graph_vars:
ret["learning_rate"] = float(outputs[3][0])
return ret
test_pyreader.start()
total_cost = 0.0
total_acc = 0.0
total_num_seqs = 0.0
total_label_pos_num = 0.0
total_pred_pos_num = 0.0
total_correct_num = 0.0
qids, labels, scores, preds = [], [], [], []
time_begin = time.time()
fetch_list = [
graph_vars["loss"].name,
graph_vars["accuracy"].name,
graph_vars["probs"].name,
graph_vars["labels"].name,
graph_vars["num_seqs"].name,
graph_vars["qids"].name,
]
while True:
try:
if use_multi_gpu_test:
np_loss, np_acc, np_probs, np_labels, np_num_seqs, np_qids = exe.run(fetch_list=fetch_list)
else:
np_loss, np_acc, np_probs, np_labels, np_num_seqs, np_qids = exe.run(
program=test_program, fetch_list=fetch_list
)
total_cost += np.sum(np_loss * np_num_seqs)
total_acc += np.sum(np_acc * np_num_seqs)
total_num_seqs += np.sum(np_num_seqs)
labels.extend(np_labels.reshape((-1)).tolist())
if np_qids is None:
np_qids = np.array([])
qids.extend(np_qids.reshape(-1).tolist())
scores.extend(np_probs[:, 1].reshape(-1).tolist())
np_preds = np.argmax(np_probs, axis=1).astype(np.float32)
preds.extend(np_preds)
total_label_pos_num += np.sum(np_labels)
total_pred_pos_num += np.sum(np_preds)
total_correct_num += np.sum(np.dot(np_preds, np_labels))
except fluid.core.EOFException:
test_pyreader.reset()
break
time_end = time.time()
cost = total_cost / total_num_seqs
elapsed_time = time_end - time_begin
evaluate_info = ""
if metric == "acc_and_f1":
ret = acc_and_f1(preds, labels)
evaluate_info = "[%s evaluation] ave loss: %f, ave_acc: %f, f1: %f, data_num: %d, elapsed time: %f s" % (
eval_phase,
cost,
ret["acc"],
ret["f1"],
total_num_seqs,
elapsed_time,
)
elif metric == "matthews_corrcoef":
ret = matthews_corrcoef(preds, labels)
evaluate_info = "[%s evaluation] ave loss: %f, matthews_corrcoef: %f, data_num: %d, elapsed time: %f s" % (
eval_phase,
cost,
ret,
total_num_seqs,
elapsed_time,
)
elif metric == "pearson_and_spearman":
ret = pearson_and_spearman(scores, labels)
evaluate_info = (
"[%s evaluation] ave loss: %f, pearson:%f, spearman:%f, corr:%f, data_num: %d, elapsed time: %f s"
% (eval_phase, cost, ret["pearson"], ret["spearman"], ret["corr"], total_num_seqs, elapsed_time)
)
elif metric == "simple_accuracy":
ret = simple_accuracy(preds, labels)
evaluate_info = "[%s evaluation] ave loss: %f, acc:%f, data_num: %d, elapsed time: %f s" % (
eval_phase,
cost,
ret,
total_num_seqs,
elapsed_time,
)
elif metric == "acc_and_f1_and_mrr":
ret_a = acc_and_f1(preds, labels)
preds = sorted(zip(qids, scores, labels), key=lambda elem: (elem[0], -elem[1]))
ret_b = evaluate_mrr(preds)
evaluate_info = "[%s evaluation] ave loss: %f, acc: %f, f1: %f, mrr: %f, data_num: %d, elapsed time: %f s" % (
eval_phase,
cost,
ret_a["acc"],
ret_a["f1"],
ret_b,
total_num_seqs,
elapsed_time,
)
else:
raise ValueError("unsupported metric {}".format(metric))
return evaluate_info
def matthews_corrcoef(preds, labels):
preds = np.array(preds)
labels = np.array(labels)
tp = np.sum((labels == 1) & (preds == 1))
tn = np.sum((labels == 0) & (preds == 0))
fp = np.sum((labels == 0) & (preds == 1))
fn = np.sum((labels == 1) & (preds == 0))
mcc = ((tp * tn) - (fp * fn)) / np.sqrt((tp + fp) * (tp + fn) * (tn + fp) * (tn + fn))
return mcc
def f1_score(preds, labels):
preds = np.array(preds)
labels = np.array(labels)
tp = np.sum((labels == 1) & (preds == 1))
fp = np.sum((labels == 0) & (preds == 1))
fn = np.sum((labels == 1) & (preds == 0))
p = tp / (tp + fp)
r = tp / (tp + fn)
f1 = (2 * p * r) / (p + r + 1e-8)
return f1
def pearson_and_spearman(preds, labels):
preds = np.array(preds)
labels = np.array(labels)
pearson_corr = pearsonr(preds, labels)[0]
spearman_corr = spearmanr(preds, labels)[0]
return {
"pearson": pearson_corr,
"spearmanr": spearman_corr,
"corr": (pearson_corr + spearman_corr) / 2,
}
def acc_and_f1(preds, labels):
preds = np.array(preds)
labels = np.array(labels)
acc = simple_accuracy(preds, labels)
f1 = f1_score(preds, labels)
return {
"acc": acc,
"f1": f1,
"acc_and_f1": (acc + f1) / 2,
}
def simple_accuracy(preds, labels):
preds = np.array(preds)
labels = np.array(labels)
return (preds == labels).mean()
def predict(exe, test_program, test_pyreader, graph_vars, dev_count=1):
test_pyreader.start()
qids, probs = [], []
preds = []
fetch_list = [graph_vars["probs"].name, graph_vars["qids"].name]
while True:
try:
if dev_count == 1:
np_probs, np_qids = exe.run(program=test_program, fetch_list=fetch_list)
else:
np_probs, np_qids = exe.run(fetch_list=fetch_list)
if np_qids is None:
np_qids = np.array([])
qids.extend(np_qids.reshape(-1).tolist())
np_preds = np.argmax(np_probs, axis=1).astype(np.float32)
preds.extend(np_preds)
probs.append(np_probs)
except fluid.core.EOFException:
test_pyreader.reset()
break
probs = np.concatenate(probs, axis=0).reshape([len(preds), -1])
return qids, preds, probs
# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import argparse
from src.utils.args import ArgumentGroup
# yapf: disable
parser = argparse.ArgumentParser(__doc__)
model_g = ArgumentGroup(parser, "model", "model configuration and paths.")
model_g.add_arg("ernie_config_path", str, None, "Path to the json file for ernie model config.")
model_g.add_arg("init_checkpoint", str, None, "Init checkpoint to resume training from.")
model_g.add_arg("init_pretraining_params", str, None, "Init pre-training params which preforms fine-tuning from. If the arg 'init_checkpoint' has been set, this argument wouldn't be valid.")
model_g.add_arg("checkpoints", str, "checkpoints", "Path to save checkpoints.")
model_g.add_arg("is_classify", bool, True, "is_classify")
model_g.add_arg("is_regression", bool, False, "is_regression")
model_g.add_arg("task_id", int, 0, "task id")
train_g = ArgumentGroup(parser, "training", "training options.")
train_g.add_arg("epoch", int, 3, "Number of epoches for fine-tuning.")
train_g.add_arg("learning_rate", float, 5e-5, "Learning rate used to train with warmup.")
train_g.add_arg("lr_scheduler", str, "linear_warmup_decay", "scheduler of learning rate.", choices=['linear_warmup_decay', 'noam_decay'])
train_g.add_arg("weight_decay", float, 0.01, "Weight decay rate for L2 regularizer.")
train_g.add_arg("warmup_proportion", float, 0.1, "Proportion of training steps to perform linear learning rate warmup for.")
train_g.add_arg("save_steps", int, 10000, "The steps interval to save checkpoints.")
train_g.add_arg("validation_steps", int, 1000, "The steps interval to evaluate model performance.")
train_g.add_arg("use_recompute", bool, False, "Whether to use recompute optimizer for training.")
train_g.add_arg("use_mix_precision", bool, False, "Whether to use mix-precision optimizer for training.")
train_g.add_arg("use_cross_batch", bool, False, "Whether to use cross-batch for training.")
train_g.add_arg("use_lamb", bool, False, "Whether to use LambOptimizer for training.")
train_g.add_arg("use_dynamic_loss_scaling", bool, True, "Whether to use dynamic loss scaling.")
train_g.add_arg("test_save", str, "./checkpoints/test_result", "test_save")
train_g.add_arg("metric", str, "simple_accuracy", "metric")
train_g.add_arg("incr_every_n_steps", int, 100, "Increases loss scaling every n consecutive.")
train_g.add_arg("decr_every_n_nan_or_inf", int, 2, "Decreases loss scaling every n accumulated steps with nan or inf gradients.")
train_g.add_arg("incr_ratio", float, 2.0, "The multiplier to use when increasing the loss scaling.")
train_g.add_arg("decr_ratio", float, 0.8, "The less-than-one-multiplier to use when decreasing.")
log_g = ArgumentGroup(parser, "logging", "logging related.")
log_g.add_arg("skip_steps", int, 10, "The steps interval to print loss.")
log_g.add_arg("verbose", bool, False, "Whether to output verbose log.")
data_g = ArgumentGroup(parser, "data", "Data paths, vocab paths and data processing options")
data_g.add_arg("tokenizer", str, "FullTokenizer", "ATTENTION: the INPUT must be splited by Word with blank while using SentencepieceTokenizer or WordsegTokenizer")
data_g.add_arg("train_set", str, None, "Path to training data.")
data_g.add_arg("test_set", str, None, "Path to test data.")
data_g.add_arg("dev_set", str, None, "Path to validation data.")
data_g.add_arg("vocab_path", str, None, "Vocabulary path.")
data_g.add_arg("max_seq_len", int, 512, "Number of words of the longest seqence.")
data_g.add_arg("q_max_seq_len", int, 32, "Number of words of the longest seqence.")
data_g.add_arg("p_max_seq_len", int, 256, "Number of words of the longest seqence.")
data_g.add_arg("train_data_size", int, 0, "Number of training data's total examples. Set for distribute.")
data_g.add_arg("batch_size", int, 32, "Total examples' number in batch for training. see also --in_tokens.")
data_g.add_arg("predict_batch_size", int, None, "Total examples' number in batch for predict. see also --in_tokens.")
data_g.add_arg("in_tokens", bool, False, "If set, the batch size will be the maximum number of tokens in one batch. Otherwise, it will be the maximum number of examples in one batch.")
data_g.add_arg("do_lower_case", bool, True, "Whether to lower case the input text. Should be True for uncased models and False for cased models.")
data_g.add_arg("random_seed", int, None, "Random seed.")
data_g.add_arg("label_map_config", str, None, "label_map_path.")
data_g.add_arg("num_labels", int, 2, "label number")
data_g.add_arg("diagnostic", str, None, "GLUE Diagnostic Dataset")
data_g.add_arg("diagnostic_save", str, None, "GLUE Diagnostic save f")
data_g.add_arg("max_query_length", int, 64, "Max query length.")
data_g.add_arg("max_answer_length", int, 100, "Max answer length.")
data_g.add_arg("doc_stride", int, 128, "When splitting up a long document into chunks, how much stride to take between chunks.")
data_g.add_arg("n_best_size", int, 20, "The total number of n-best predictions to generate in the nbest_predictions.json output file.")
data_g.add_arg("chunk_scheme", type=str, default="IOB", choices=["IO", "IOB", "IOE", "IOBES"], help="chunk scheme")
run_type_g = ArgumentGroup(parser, "run_type", "running type options.")
run_type_g.add_arg("use_cuda", bool, True, "If set, use GPU for training.")
run_type_g.add_arg("is_distributed", bool, False, "If set, then start distributed training.")
run_type_g.add_arg("use_fast_executor", bool, False, "If set, use fast parallel executor (in experiment).")
run_type_g.add_arg("num_iteration_per_drop_scope", int, 10, "Iteration intervals to drop scope.")
run_type_g.add_arg("do_train", bool, True, "Whether to perform training.")
run_type_g.add_arg("do_val", bool, True, "Whether to perform evaluation on dev data set.")
run_type_g.add_arg("do_test", bool, True, "Whether to perform evaluation on test data set.")
run_type_g.add_arg("output_item", int, 3, "Test output format.")
run_type_g.add_arg("output_file_name", str, None, "Test output file name")
run_type_g.add_arg("test_data_cnt", int, 1110000 , "total cnt of testset")
run_type_g.add_arg("use_multi_gpu_test", bool, False, "Whether to perform evaluation using multiple gpu cards")
run_type_g.add_arg("metrics", bool, True, "Whether to perform evaluation on test data set.")
run_type_g.add_arg("shuffle", bool, True, "")
run_type_g.add_arg("for_cn", bool, False, "model train for cn or for other langs.")
# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import sys
import faiss
import numpy as np
def read_embed(file_name, dim=768, bs=3000):
if file_name.endswith("npy"):
i = 0
emb_np = np.load(file_name)
while i < len(emb_np):
vec_list = emb_np[i : i + bs]
i += bs
yield vec_list
else:
vec_list = []
with open(file_name) as inp:
for line in inp:
data = line.strip()
vector = [float(item) for item in data.split(" ")]
assert len(vector) == dim
vec_list.append(vector)
if len(vec_list) == bs:
yield vec_list
vec_list = []
if vec_list:
yield vec_list
def load_qid(file_name):
qid_list = []
with open(file_name) as inp:
for line in inp:
line = line.strip()
qid = line.split("\t")[0]
qid_list.append(qid)
return qid_list
def search(index, emb_file, qid_list, outfile, top_k):
q_idx = 0
with open(outfile, "w") as out:
for batch_vec in read_embed(emb_file):
q_emb_matrix = np.array(batch_vec)
res_dist, res_p_id = index.search(q_emb_matrix.astype("float32"), top_k)
for i in range(len(q_emb_matrix)):
qid = qid_list[q_idx]
for j in range(top_k):
pid = res_p_id[i][j]
score = res_dist[i][j]
out.write("%s\t%s\t%s\t%s\n" % (qid, pid, j + 1, score))
q_idx += 1
def main():
part = sys.argv[1]
topk = int(sys.argv[2])
q_text_file = sys.argv[3]
outfile = "output/res.top%s-part%s" % (topk, part)
qid_list = load_qid(q_text_file)
engine = faiss.read_index("output/para.index.part%s" % part)
emb_file = "output/query.emb.npy"
search(engine, emb_file, qid_list, outfile, topk)
if __name__ == "__main__":
main()
# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import sys
shift = int(sys.argv[1])
top = int(sys.argv[2])
total_part = int(sys.argv[3])
f_list = []
for part in range(total_part):
f0 = open("output/res.top%s-part%s" % (top, part))
f_list.append(f0)
line_list = []
for part in range(total_part):
line = f_list[part].readline()
line_list.append(line)
out = open("output/dev.res.top%s" % top, "w")
last_q = ""
ans_list = {}
while line_list[-1]:
cur_list = []
for line in line_list:
sub = line.strip().split("\t")
cur_list.append(sub)
if last_q == "":
last_q = cur_list[0][0]
if cur_list[0][0] != last_q:
rank = sorted(ans_list.items(), key=lambda a: a[1], reverse=True)
for i in range(top):
out.write("%s\t%s\t%s\t%s\n" % (last_q, rank[i][0], i + 1, rank[i][1]))
ans_list = {}
for i, sub in enumerate(cur_list):
ans_list[int(sub[1]) + shift * i] = float(sub[-1])
last_q = cur_list[0][0]
line_list = []
for f0 in f_list:
line = f0.readline()
line_list.append(line)
rank = sorted(ans_list.items(), key=lambda a: a[1], reverse=True)
for i in range(top):
out.write("%s\t%s\t%s\t%s\n" % (last_q, rank[i][0], i + 1, rank[i][1]))
out.close()
print("output/dev.res.top%s" % top)
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Ernie model."""
from __future__ import absolute_import, division, print_function, unicode_literals
import json
import logging
from io import open
import paddle
import paddle.fluid as fluid
import six
from model.transformer_encoder import encoder, pre_process_layer
log = logging.getLogger(__name__)
class ErnieConfig(object):
def __init__(self, config_path):
self._config_dict = self._parse(config_path)
def _parse(self, config_path):
try:
with open(config_path, "r", encoding="utf8") as json_file:
config_dict = json.load(json_file)
except Exception:
raise IOError("Error in parsing Ernie model config file '%s'" % config_path)
else:
return config_dict
def __getitem__(self, key):
return self._config_dict.get(key, None)
def print_config(self):
for arg, value in sorted(six.iteritems(self._config_dict)):
log.info("%s: %s" % (arg, value))
log.info("------------------------------------------------")
class ErnieModel(object):
def __init__(
self,
src_ids,
position_ids,
sentence_ids,
task_ids,
input_mask,
config,
weight_sharing=True,
model_name="",
is_noise=False,
):
self._emb_size = config["hidden_size"]
self._n_layer = config["num_hidden_layers"]
self._n_head = config["num_attention_heads"]
self._voc_size = config["vocab_size"]
self._max_position_seq_len = config["max_position_embeddings"]
if config["sent_type_vocab_size"]:
self._sent_types = config["sent_type_vocab_size"]
else:
self._sent_types = config["type_vocab_size"]
self._use_task_id = config["use_task_id"]
if self._use_task_id:
self._task_types = config["task_type_vocab_size"]
self._hidden_act = config["hidden_act"]
self._prepostprocess_dropout = config["hidden_dropout_prob"]
self._attention_dropout = config["attention_probs_dropout_prob"]
if is_noise:
self._prepostprocess_dropout = 0
self._attention_dropout = 0
self._weight_sharing = weight_sharing
self.checkpoints = []
self._word_emb_name = "word_embedding"
self._pos_emb_name = "pos_embedding"
self._sent_emb_name = "sent_embedding"
self._task_emb_name = "task_embedding"
self._emb_dtype = "float32"
# Initialize all weights by truncated normal initializer, and all biases
# will be initialized by constant zero by default.
self._param_initializer = fluid.initializer.TruncatedNormal(scale=config["initializer_range"])
self._build_model(model_name, src_ids, position_ids, sentence_ids, task_ids, input_mask)
def _build_model(self, model_name, src_ids, position_ids, sentence_ids, task_ids, input_mask):
# padding id in vocabulary must be set to 0
emb_out = fluid.layers.embedding(
input=src_ids,
size=[self._voc_size, self._emb_size],
dtype=self._emb_dtype,
param_attr=fluid.ParamAttr(name=model_name + self._word_emb_name, initializer=self._param_initializer),
is_sparse=False,
)
position_emb_out = fluid.layers.embedding(
input=position_ids,
size=[self._max_position_seq_len, self._emb_size],
dtype=self._emb_dtype,
param_attr=fluid.ParamAttr(name=model_name + self._pos_emb_name, initializer=self._param_initializer),
)
sent_emb_out = fluid.layers.embedding(
sentence_ids,
size=[self._sent_types, self._emb_size],
dtype=self._emb_dtype,
param_attr=fluid.ParamAttr(name=model_name + self._sent_emb_name, initializer=self._param_initializer),
)
emb_out = emb_out + position_emb_out
emb_out = emb_out + sent_emb_out
if self._use_task_id:
task_emb_out = fluid.layers.embedding(
task_ids,
size=[self._task_types, self._emb_size],
dtype=self._emb_dtype,
param_attr=fluid.ParamAttr(name=model_name + self._task_emb_name, initializer=self._param_initializer),
)
emb_out = emb_out + task_emb_out
emb_out = pre_process_layer(emb_out, "nd", self._prepostprocess_dropout, name=model_name + "pre_encoder")
self_attn_mask = paddle.matmul(x=input_mask, y=input_mask, transpose_y=True)
self_attn_mask = fluid.layers.scale(x=self_attn_mask, scale=10000.0, bias=-1.0, bias_after_scale=False)
n_head_self_attn_mask = fluid.layers.stack(x=[self_attn_mask] * self._n_head, axis=1)
n_head_self_attn_mask.stop_gradient = True
self._enc_out, self.checkpoints = encoder(
enc_input=emb_out,
attn_bias=n_head_self_attn_mask,
n_layer=self._n_layer,
n_head=self._n_head,
d_key=self._emb_size // self._n_head,
d_value=self._emb_size // self._n_head,
d_model=self._emb_size,
d_inner_hid=self._emb_size * 4,
prepostprocess_dropout=self._prepostprocess_dropout,
attention_dropout=self._attention_dropout,
relu_dropout=0,
hidden_act=self._hidden_act,
preprocess_cmd="",
postprocess_cmd="dan",
param_initializer=self._param_initializer,
model_name=model_name,
name=model_name + "encoder",
)
def get_sequence_output(self):
return self._enc_out
def get_cls_output(self):
"""Get the first feature of each sequence for classification"""
cls_output = fluid.layers.slice(input=self._enc_out, axes=[1], starts=[0], ends=[1])
cls_output = fluid.layers.squeeze(cls_output, axes=[1])
return cls_output
def get_pooled_output(self):
"""Get the first feature of each sequence for classification"""
next_sent_feat = fluid.layers.slice(input=self._enc_out, axes=[1], starts=[0], ends=[1])
next_sent_feat = fluid.layers.fc(
input=next_sent_feat,
size=self._emb_size,
act="tanh",
param_attr=fluid.ParamAttr(name="pooled_fc.w_0", initializer=self._param_initializer),
bias_attr="pooled_fc.b_0",
)
return next_sent_feat
def get_lm_output(self, mask_label, mask_pos):
"""Get the loss & accuracy for pretraining"""
mask_pos = fluid.layers.cast(x=mask_pos, dtype="int32")
# extract the first token feature in each sentence
self.next_sent_feat = self.get_pooled_output()
reshaped_emb_out = fluid.layers.reshape(x=self._enc_out, shape=[-1, self._emb_size])
# extract masked tokens' feature
mask_feat = fluid.layers.gather(input=reshaped_emb_out, index=mask_pos)
# transform: fc
mask_trans_feat = fluid.layers.fc(
input=mask_feat,
size=self._emb_size,
act=self._hidden_act,
param_attr=fluid.ParamAttr(name="mask_lm_trans_fc.w_0", initializer=self._param_initializer),
bias_attr=fluid.ParamAttr(name="mask_lm_trans_fc.b_0"),
)
# transform: layer norm
mask_trans_feat = fluid.layers.layer_norm(
mask_trans_feat,
begin_norm_axis=len(mask_trans_feat.shape) - 1,
param_attr=fluid.ParamAttr(
name="mask_lm_trans_layer_norm_scale", initializer=fluid.initializer.Constant(1.0)
),
bias_attr=fluid.ParamAttr(
name="mask_lm_trans_layer_norm_bias", initializer=fluid.initializer.Constant(1.0)
),
)
# transform: layer norm
# mask_trans_feat = pre_process_layer(
# mask_trans_feat, 'n', name='mask_lm_trans')
mask_lm_out_bias_attr = fluid.ParamAttr(
name="mask_lm_out_fc.b_0", initializer=fluid.initializer.Constant(value=0.0)
)
if self._weight_sharing:
fc_out = paddle.matmul(
x=mask_trans_feat,
y=fluid.default_main_program().global_block().var(self._word_emb_name),
transpose_y=True,
)
fc_out += fluid.layers.create_parameter(
shape=[self._voc_size], dtype=self._emb_dtype, attr=mask_lm_out_bias_attr, is_bias=True
)
else:
fc_out = fluid.layers.fc(
input=mask_trans_feat,
size=self._voc_size,
param_attr=fluid.ParamAttr(name="mask_lm_out_fc.w_0", initializer=self._param_initializer),
bias_attr=mask_lm_out_bias_attr,
)
mask_lm_loss = fluid.layers.softmax_with_cross_entropy(logits=fc_out, label=mask_label)
mean_mask_lm_loss = fluid.layers.mean(mask_lm_loss)
return mean_mask_lm_loss
def get_task_output(self, task, task_labels):
task_fc_out = fluid.layers.fc(
input=self.next_sent_feat,
size=task["num_labels"],
param_attr=fluid.ParamAttr(name=task["task_name"] + "_fc.w_0", initializer=self._param_initializer),
bias_attr=task["task_name"] + "_fc.b_0",
)
task_loss, task_softmax = fluid.layers.softmax_with_cross_entropy(
logits=task_fc_out, label=task_labels, return_softmax=True
)
task_acc = fluid.layers.accuracy(input=task_softmax, label=task_labels)
mean_task_loss = fluid.layers.mean(task_loss)
return mean_task_loss, task_acc
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Transformer encoder."""
from __future__ import absolute_import, division, print_function
from functools import partial
import paddle
import paddle.fluid as fluid
import paddle.fluid.layers as layers
def multi_head_attention(
queries,
keys,
values,
attn_bias,
d_key,
d_value,
d_model,
n_head=1,
dropout_rate=0.0,
cache=None,
param_initializer=None,
name="multi_head_att",
):
"""
Multi-Head Attention. Note that attn_bias is added to the logit before
computing softmax activation to mask certain selected positions so that
they will not considered in attention weights.
"""
keys = queries if keys is None else keys
values = keys if values is None else values
if not (len(queries.shape) == len(keys.shape) == len(values.shape) == 3):
raise ValueError("Inputs: queries, keys and values should all be 3-D tensors.")
def __compute_qkv(queries, keys, values, n_head, d_key, d_value):
"""
Add linear projection to queries, keys, and values.
"""
q = layers.fc(
input=queries,
size=d_key * n_head,
num_flatten_dims=2,
param_attr=fluid.ParamAttr(name=name + "_query_fc.w_0", initializer=param_initializer),
bias_attr=name + "_query_fc.b_0",
)
k = layers.fc(
input=keys,
size=d_key * n_head,
num_flatten_dims=2,
param_attr=fluid.ParamAttr(name=name + "_key_fc.w_0", initializer=param_initializer),
bias_attr=name + "_key_fc.b_0",
)
v = layers.fc(
input=values,
size=d_value * n_head,
num_flatten_dims=2,
param_attr=fluid.ParamAttr(name=name + "_value_fc.w_0", initializer=param_initializer),
bias_attr=name + "_value_fc.b_0",
)
return q, k, v
def __split_heads(x, n_head):
"""
Reshape the last dimension of input tensor x so that it becomes two
dimensions and then transpose. Specifically, input a tensor with shape
[bs, max_sequence_length, n_head * hidden_dim] then output a tensor
with shape [bs, n_head, max_sequence_length, hidden_dim].
"""
hidden_size = x.shape[-1]
# The value 0 in shape attr means copying the corresponding dimension
# size of the input as the output dimension size.
reshaped = layers.reshape(x=x, shape=[0, 0, n_head, hidden_size // n_head], inplace=True)
# permuate the dimensions into:
# [batch_size, n_head, max_sequence_len, hidden_size_per_head]
return layers.transpose(x=reshaped, perm=[0, 2, 1, 3])
def __combine_heads(x):
"""
Transpose and then reshape the last two dimensions of input tensor x
so that it becomes one dimension, which is reverse to __split_heads.
"""
if len(x.shape) == 3:
return x
if len(x.shape) != 4:
raise ValueError("Input(x) should be a 4-D Tensor.")
trans_x = layers.transpose(x, perm=[0, 2, 1, 3])
# The value 0 in shape attr means copying the corresponding dimension
# size of the input as the output dimension size.
return layers.reshape(x=trans_x, shape=[0, 0, trans_x.shape[2] * trans_x.shape[3]], inplace=True)
def scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate):
"""
Scaled Dot-Product Attention
"""
scaled_q = layers.scale(x=q, scale=d_key**-0.5)
product = paddle.matmul(x=scaled_q, y=k, transpose_y=True)
if attn_bias:
product += attn_bias
weights = layers.softmax(product)
if dropout_rate:
weights = layers.dropout(
weights, dropout_prob=dropout_rate, dropout_implementation="upscale_in_train", is_test=False
)
out = paddle.matmul(weights, v)
return out
q, k, v = __compute_qkv(queries, keys, values, n_head, d_key, d_value)
if cache is not None: # use cache and concat time steps
# Since the inplace reshape in __split_heads changes the shape of k and
# v, which is the cache input for next time step, reshape the cache
# input from the previous time step first.
k = cache["k"] = layers.concat([layers.reshape(cache["k"], shape=[0, 0, d_model]), k], axis=1)
v = cache["v"] = layers.concat([layers.reshape(cache["v"], shape=[0, 0, d_model]), v], axis=1)
q = __split_heads(q, n_head)
k = __split_heads(k, n_head)
v = __split_heads(v, n_head)
ctx_multiheads = scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate)
out = __combine_heads(ctx_multiheads)
# Project back to the model size.
proj_out = layers.fc(
input=out,
size=d_model,
num_flatten_dims=2,
param_attr=fluid.ParamAttr(name=name + "_output_fc.w_0", initializer=param_initializer),
bias_attr=name + "_output_fc.b_0",
)
return proj_out
def positionwise_feed_forward(x, d_inner_hid, d_hid, dropout_rate, hidden_act, param_initializer=None, name="ffn"):
"""
Position-wise Feed-Forward Networks.
This module consists of two linear transformations with a ReLU activation
in between, which is applied to each position separately and identically.
"""
hidden = layers.fc(
input=x,
size=d_inner_hid,
num_flatten_dims=2,
act=hidden_act,
param_attr=fluid.ParamAttr(name=name + "_fc_0.w_0", initializer=param_initializer),
bias_attr=name + "_fc_0.b_0",
)
if dropout_rate:
hidden = layers.dropout(
hidden, dropout_prob=dropout_rate, dropout_implementation="upscale_in_train", is_test=False
)
out = layers.fc(
input=hidden,
size=d_hid,
num_flatten_dims=2,
param_attr=fluid.ParamAttr(name=name + "_fc_1.w_0", initializer=param_initializer),
bias_attr=name + "_fc_1.b_0",
)
return out
def pre_post_process_layer(prev_out, out, process_cmd, dropout_rate=0.0, name=""):
"""
Add residual connection, layer normalization and dropout to the out tensor
optionally according to the value of process_cmd.
This will be used before or after multi-head attention and position-wise
feed-forward networks.
"""
for cmd in process_cmd:
if cmd == "a": # add residual connection
out = out + prev_out if prev_out else out
elif cmd == "n": # add layer normalization
out_dtype = out.dtype
if out_dtype == fluid.core.VarDesc.VarType.FP16:
out = layers.cast(x=out, dtype="float32")
out = layers.layer_norm(
out,
begin_norm_axis=len(out.shape) - 1,
param_attr=fluid.ParamAttr(
name=name + "_layer_norm_scale", initializer=fluid.initializer.Constant(1.0)
),
bias_attr=fluid.ParamAttr(name=name + "_layer_norm_bias", initializer=fluid.initializer.Constant(0.0)),
)
if out_dtype == fluid.core.VarDesc.VarType.FP16:
out = layers.cast(x=out, dtype="float16")
elif cmd == "d": # add dropout
if dropout_rate:
out = layers.dropout(
out, dropout_prob=dropout_rate, dropout_implementation="upscale_in_train", is_test=False
)
return out
pre_process_layer = partial(pre_post_process_layer, None)
post_process_layer = pre_post_process_layer
def encoder_layer(
enc_input,
attn_bias,
n_head,
d_key,
d_value,
d_model,
d_inner_hid,
prepostprocess_dropout,
attention_dropout,
relu_dropout,
hidden_act,
preprocess_cmd="n",
postprocess_cmd="da",
param_initializer=None,
name="",
):
"""The encoder layers that can be stacked to form a deep encoder.
This module consists of a multi-head (self) attention followed by
position-wise feed-forward networks and both the two components companied
with the post_process_layer to add residual connection, layer normalization
and dropout.
"""
attn_output = multi_head_attention(
pre_process_layer(enc_input, preprocess_cmd, prepostprocess_dropout, name=name + "_pre_att"),
None,
None,
attn_bias,
d_key,
d_value,
d_model,
n_head,
attention_dropout,
param_initializer=param_initializer,
name=name + "_multi_head_att",
)
attn_output = post_process_layer(
enc_input, attn_output, postprocess_cmd, prepostprocess_dropout, name=name + "_post_att"
)
ffd_output = positionwise_feed_forward(
pre_process_layer(attn_output, preprocess_cmd, prepostprocess_dropout, name=name + "_pre_ffn"),
d_inner_hid,
d_model,
relu_dropout,
hidden_act,
param_initializer=param_initializer,
name=name + "_ffn",
)
return (
post_process_layer(attn_output, ffd_output, postprocess_cmd, prepostprocess_dropout, name=name + "_post_ffn"),
ffd_output,
)
def encoder(
enc_input,
attn_bias,
n_layer,
n_head,
d_key,
d_value,
d_model,
d_inner_hid,
prepostprocess_dropout,
attention_dropout,
relu_dropout,
hidden_act,
preprocess_cmd="n",
postprocess_cmd="da",
param_initializer=None,
model_name="",
name="",
):
"""
The encoder is composed of a stack of identical layers returned by calling
encoder_layer.
"""
checkpoints = []
for i in range(n_layer):
enc_output, cp = encoder_layer(
enc_input,
attn_bias,
n_head,
d_key,
d_value,
d_model,
d_inner_hid,
prepostprocess_dropout,
attention_dropout,
relu_dropout,
hidden_act,
preprocess_cmd,
postprocess_cmd,
param_initializer=param_initializer,
name=name + "_layer_" + str(i),
)
checkpoints.append(cp)
enc_input = enc_output
enc_output = pre_process_layer(
enc_output, preprocess_cmd, prepostprocess_dropout, name=model_name + "post_encoder"
)
return enc_output, checkpoints
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Optimization and learning rate scheduling."""
import paddle.fluid as fluid
from paddle.fluid.incubate.fleet.collective import fleet
def linear_warmup_decay(learning_rate, warmup_steps, num_train_steps):
"""Applies linear warmup of learning rate from 0 and decay to 0."""
with fluid.default_main_program()._lr_schedule_guard():
lr = fluid.layers.tensor.create_global_var(
shape=[1], value=0.0, dtype="float32", persistable=True, name="scheduled_learning_rate"
)
global_step = fluid.layers.learning_rate_scheduler._decay_step_counter()
with fluid.layers.control_flow.Switch() as switch:
with switch.case(global_step < warmup_steps):
warmup_lr = learning_rate * (global_step / warmup_steps)
fluid.layers.tensor.assign(warmup_lr, lr)
with switch.default():
decayed_lr = fluid.layers.learning_rate_scheduler.polynomial_decay(
learning_rate=learning_rate,
decay_steps=num_train_steps,
end_learning_rate=0.0,
power=1.0,
cycle=False,
)
fluid.layers.tensor.assign(decayed_lr, lr)
return lr
def optimization(
loss,
warmup_steps,
num_train_steps,
learning_rate,
train_program,
startup_prog,
weight_decay,
scheduler="linear_warmup_decay",
use_dynamic_loss_scaling=False,
incr_every_n_steps=1000,
decr_every_n_nan_or_inf=2,
incr_ratio=2.0,
decr_ratio=0.8,
dist_strategy=None,
use_lamb=False,
):
if warmup_steps > 0:
if scheduler == "noam_decay":
scheduled_lr = fluid.layers.learning_rate_scheduler.noam_decay(
1 / (warmup_steps * (learning_rate**2)), warmup_steps
)
elif scheduler == "linear_warmup_decay":
scheduled_lr = linear_warmup_decay(learning_rate, warmup_steps, num_train_steps)
else:
raise ValueError("Unknown learning rate scheduler, should be " "'noam_decay' or 'linear_warmup_decay'")
if use_lamb:
optimizer = fluid.optimizer.LambOptimizer(learning_rate=scheduled_lr)
else:
optimizer = fluid.optimizer.Adam(learning_rate=scheduled_lr)
else:
scheduled_lr = fluid.layers.create_global_var(
name=fluid.unique_name.generate("learning_rate"),
shape=[1],
value=learning_rate,
dtype="float32",
persistable=True,
)
if use_lamb:
optimizer = fluid.optimizer.LambOptimizer(learning_rate=scheduled_lr)
else:
optimizer = fluid.optimizer.Adam(learning_rate=scheduled_lr)
optimizer._learning_rate_map[fluid.default_main_program()] = scheduled_lr
fluid.clip.set_gradient_clip(clip=fluid.clip.GradientClipByGlobalNorm(clip_norm=1.0))
def exclude_from_weight_decay(name):
if name.find("layer_norm") > -1:
return True
bias_suffix = ["_bias", "_b", ".b_0"]
for suffix in bias_suffix:
if name.endswith(suffix):
return True
return False
param_list = dict()
for param in train_program.global_block().all_parameters():
param_list[param.name] = param * 1.0
param_list[param.name].stop_gradient = True
if dist_strategy is not None:
# use fleet api
optimizer = fleet.distributed_optimizer(optimizer, strategy=dist_strategy)
_, param_grads = optimizer.minimize(loss)
if weight_decay > 0:
for param, grad in param_grads:
if exclude_from_weight_decay(param.name):
continue
with param.block.program._optimized_guard([param, grad]), fluid.framework.name_scope("weight_decay"):
updated_param = param - param_list[param.name] * weight_decay * scheduled_lr
fluid.layers.assign(output=param, input=updated_param)
return scheduled_lr
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import json
import logging
import sys
from collections import namedtuple
from io import open
import numpy as np
import six
import tokenization
from batching import pad_batch_data
log = logging.getLogger(__name__)
if six.PY3:
import io
sys.stdout = io.TextIOWrapper(sys.stdout.buffer, encoding="utf-8")
sys.stderr = io.TextIOWrapper(sys.stderr.buffer, encoding="utf-8")
def csv_reader(fd, delimiter="\t", trainer_id=0, trainer_num=1):
def gen():
for i, line in enumerate(fd):
if i % trainer_num == trainer_id:
slots = line.rstrip("\n").split(delimiter)
if len(slots) == 1:
yield slots,
else:
yield slots
return gen()
class BaseReader(object):
def __init__(
self,
vocab_path,
label_map_config=None,
max_seq_len=512,
total_num=0,
do_lower_case=True,
in_tokens=False,
is_inference=False,
random_seed=None,
tokenizer="FullTokenizer",
for_cn=True,
task_id=0,
):
self.max_seq_len = max_seq_len
self.tokenizer = tokenization.FullTokenizer(vocab_file=vocab_path, do_lower_case=do_lower_case)
self.vocab = self.tokenizer.vocab
self.pad_id = self.vocab["[PAD]"]
self.cls_id = self.vocab["[CLS]"]
self.sep_id = self.vocab["[SEP]"]
self.in_tokens = in_tokens
self.is_inference = is_inference
self.for_cn = for_cn
self.task_id = task_id
np.random.seed(random_seed)
self.current_example = 0
self.current_epoch = 0
self.num_examples = 0
self.total_num = total_num
if label_map_config:
with open(label_map_config, encoding="utf8") as f:
self.label_map = json.load(f)
else:
self.label_map = None
def get_train_progress(self):
"""Gets progress for training phase."""
return self.current_example, self.current_epoch
def _read_tsv(self, input_file, quotechar=None):
"""Reads a tab separated value file."""
with open(input_file, "r", encoding="utf8") as f:
reader = csv_reader(f)
headers = next(reader)
Example = namedtuple("Example", headers)
examples = []
for line in reader:
example = Example(*line)
examples.append(example)
return examples
def _truncate_seq_pair(self, tokens_a, tokens_b, max_length):
"""Truncates a sequence pair in place to the maximum length."""
# This is a simple heuristic which will always truncate the longer sequence
# one token at a time. This makes more sense than truncating an equal percent
# of tokens from each, since if one sequence is very short then each token
# that's truncated likely contains more information than a longer sequence.
while True:
total_length = len(tokens_a) + len(tokens_b)
if total_length <= max_length:
break
if len(tokens_a) > len(tokens_b):
tokens_a.pop()
else:
tokens_b.pop()
def _convert_example_to_record(self, example, max_seq_length, tokenizer):
"""Converts a single `Example` into a single `Record`."""
query = tokenization.convert_to_unicode(example.query)
tokens_a = tokenizer.tokenize(query)
tokens_b = None
title = tokenization.convert_to_unicode(example.title)
tokens_b = tokenizer.tokenize(title)
para = tokenization.convert_to_unicode(example.para)
tokens_para = tokenizer.tokenize(para)
tokens_b.extend(tokens_para)
self._truncate_seq_pair(tokens_a, tokens_b, max_seq_length - 3)
# The convention in BERT/ERNIE is:
# (a) For sequence pairs:
# tokens: [CLS] is this jack ##son ##ville ? [SEP] no it is not . [SEP]
# type_ids: 0 0 0 0 0 0 0 0 1 1 1 1 1 1
# (b) For single sequences:
# tokens: [CLS] the dog is hairy . [SEP]
# type_ids: 0 0 0 0 0 0 0
#
# Where "type_ids" are used to indicate whether this is the first
# sequence or the second sequence. The embedding vectors for `type=0` and
# `type=1` were learned during pre-training and are added to the wordpiece
# embedding vector (and position vector). This is not *strictly* necessary
# since the [SEP] token unambiguously separates the sequences, but it makes
# it easier for the model to learn the concept of sequences.
#
# For classification tasks, the first vector (corresponding to [CLS]) is
# used as the "sentence vector". Note that this only makes sense because
# the entire model is fine-tuned.
tokens = []
text_type_ids = []
tokens.append("[CLS]")
text_type_ids.append(0)
for token in tokens_a:
tokens.append(token)
text_type_ids.append(0)
tokens.append("[SEP]")
text_type_ids.append(0)
if tokens_b:
for token in tokens_b:
tokens.append(token)
text_type_ids.append(1)
tokens.append("[SEP]")
text_type_ids.append(1)
token_ids = tokenizer.convert_tokens_to_ids(tokens)
position_ids = list(range(len(token_ids)))
if self.is_inference:
Record = namedtuple("Record", ["token_ids", "text_type_ids", "position_ids"])
record = Record(token_ids=token_ids, text_type_ids=text_type_ids, position_ids=position_ids)
else:
if self.label_map:
label_id = self.label_map[example.label]
else:
label_id = example.label
Record = namedtuple("Record", ["token_ids", "text_type_ids", "position_ids", "label_id", "qid"])
qid = None
if "qid" in example._fields:
qid = example.qid
record = Record(
token_ids=token_ids, text_type_ids=text_type_ids, position_ids=position_ids, label_id=label_id, qid=qid
)
return record
def _prepare_batch_data(self, examples, batch_size, phase=None):
"""generate batch records"""
batch_records, max_len = [], 0
for index, example in enumerate(examples):
if phase == "train":
self.current_example = index
record = self._convert_example_to_record(example, self.max_seq_len, self.tokenizer)
max_len = max(max_len, len(record.token_ids))
if self.in_tokens:
to_append = (len(batch_records) + 1) * max_len <= batch_size
else:
to_append = len(batch_records) < batch_size
if to_append:
batch_records.append(record)
else:
yield self._pad_batch_records(batch_records)
batch_records, max_len = [record], len(record.token_ids)
if batch_records:
yield self._pad_batch_records(batch_records)
def get_num_examples(self, input_file):
# examples = self._read_tsv(input_file)
# return len(examples)
return self.num_examples
def data_generator(
self, input_file, batch_size, epoch, dev_count=1, trainer_id=0, trainer_num=1, shuffle=True, phase=None
):
if phase == "train":
# examples = examples[trainer_id: (len(examples) //trainer_num) * trainer_num : trainer_num]
self.num_examples_per_node = self.total_num // trainer_num
self.num_examples = self.num_examples_per_node * trainer_num
examples = self._read_tsv(
input_file, trainer_id=trainer_id, trainer_num=trainer_num, num_examples=self.num_examples_per_node
)
log.info("apply sharding %d/%d" % (trainer_id, trainer_num))
else:
examples = self._read_tsv(input_file)
def wrapper():
all_dev_batches = []
for epoch_index in range(epoch):
if phase == "train":
self.current_example = 0
self.current_epoch = epoch_index
if shuffle:
np.random.shuffle(examples)
for batch_data in self._prepare_batch_data(examples, batch_size, phase=phase):
if len(all_dev_batches) < dev_count:
all_dev_batches.append(batch_data)
if len(all_dev_batches) == dev_count:
for batch in all_dev_batches:
yield batch
all_dev_batches = []
def f():
try:
for i in wrapper():
yield i
except Exception:
import traceback
traceback.print_exc()
return f
class ClassifyReader(BaseReader):
def _read_tsv(self, input_file, quotechar=None, trainer_id=0, trainer_num=1, num_examples=0):
"""Reads a tab separated value file."""
with open(input_file, "r", encoding="utf8") as f:
reader = csv_reader(f, trainer_id=trainer_id, trainer_num=trainer_num)
# headers = next(reader)
headers = "query\ttitle\tpara\tlabel".split("\t")
text_indices = [index for index, h in enumerate(headers) if h != "label"]
Example = namedtuple("Example", headers)
examples = []
for cnt, line in enumerate(reader):
if num_examples != 0 and cnt == num_examples:
break
for index, text in enumerate(line):
if index in text_indices:
if self.for_cn:
line[index] = text.replace(" ", "")
else:
line[index] = text
example = Example(*line)
examples.append(example)
return examples
def _pad_batch_records(self, batch_records):
batch_token_ids = [record.token_ids for record in batch_records]
batch_text_type_ids = [record.text_type_ids for record in batch_records]
batch_position_ids = [record.position_ids for record in batch_records]
if not self.is_inference:
batch_labels = [record.label_id for record in batch_records]
batch_labels = np.array(batch_labels).astype("int64").reshape([-1, 1])
if batch_records[0].qid:
batch_qids = [record.qid for record in batch_records]
batch_qids = np.array(batch_qids).astype("int64").reshape([-1, 1])
else:
batch_qids = np.array([]).astype("int64").reshape([-1, 1])
# padding
padded_token_ids, input_mask = pad_batch_data(batch_token_ids, pad_idx=self.pad_id, return_input_mask=True)
padded_text_type_ids = pad_batch_data(batch_text_type_ids, pad_idx=self.pad_id)
padded_position_ids = pad_batch_data(batch_position_ids, pad_idx=self.pad_id)
padded_task_ids = np.ones_like(padded_token_ids, dtype="int64") * self.task_id
return_list = [padded_token_ids, padded_text_type_ids, padded_position_ids, padded_task_ids, input_mask]
if not self.is_inference:
return_list += [batch_labels, batch_qids]
return return_list
# coding=utf-8
# Copyright 2018 The Google AI Language Team Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Tokenization classes."""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from __future__ import unicode_literals
from __future__ import absolute_import
import collections
import unicodedata
from io import open
def convert_to_unicode(text):
"""Converts `text` to Unicode (if it's not already), assuming utf-8 input."""
if isinstance(text, str):
return text
elif isinstance(text, bytes):
return text.decode("utf-8", "ignore")
else:
raise ValueError("Unsupported string type: %s" % (type(text)))
def printable_text(text):
"""Returns text encoded in a way suitable for print or `tf.logging`."""
if isinstance(text, str):
return text
elif isinstance(text, bytes):
return text.decode("utf-8", "ignore")
else:
raise ValueError("Unsupported string type: %s" % (type(text)))
def load_vocab(vocab_file):
"""Loads a vocabulary file into a dictionary."""
vocab = collections.OrderedDict()
with open(vocab_file, encoding="utf8") as fin:
for num, line in enumerate(fin):
items = convert_to_unicode(line.strip()).split("\t")
if len(items) > 2:
break
token = items[0]
index = items[1] if len(items) == 2 else num
token = token.strip()
vocab[token] = int(index)
return vocab
def convert_by_vocab(vocab, items):
"""Converts a sequence of [tokens|ids] using the vocab."""
output = []
for item in items:
output.append(vocab[item])
return output
def convert_tokens_to_ids(vocab, tokens):
return convert_by_vocab(vocab, tokens)
def convert_ids_to_tokens(inv_vocab, ids):
return convert_by_vocab(inv_vocab, ids)
def whitespace_tokenize(text):
"""Runs basic whitespace cleaning and splitting on a piece of text."""
text = text.strip()
if not text:
return []
tokens = text.split()
return tokens
class FullTokenizer(object):
"""Runs end-to-end tokenization."""
def __init__(self, vocab_file, do_lower_case=True):
self.vocab = load_vocab(vocab_file)
self.inv_vocab = {v: k for k, v in self.vocab.items()}
self.basic_tokenizer = BasicTokenizer(do_lower_case=do_lower_case)
self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.vocab)
def tokenize(self, text):
split_tokens = []
for token in self.basic_tokenizer.tokenize(text):
for sub_token in self.wordpiece_tokenizer.tokenize(token):
split_tokens.append(sub_token)
return split_tokens
def convert_tokens_to_ids(self, tokens):
return convert_by_vocab(self.vocab, tokens)
def convert_ids_to_tokens(self, ids):
return convert_by_vocab(self.inv_vocab, ids)
class CharTokenizer(object):
"""Runs end-to-end tokenization."""
def __init__(self, vocab_file, do_lower_case=True):
self.vocab = load_vocab(vocab_file)
self.inv_vocab = {v: k for k, v in self.vocab.items()}
self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.vocab)
def tokenize(self, text):
split_tokens = []
for token in text.lower().split(" "):
for sub_token in self.wordpiece_tokenizer.tokenize(token):
split_tokens.append(sub_token)
return split_tokens
def convert_tokens_to_ids(self, tokens):
return convert_by_vocab(self.vocab, tokens)
def convert_ids_to_tokens(self, ids):
return convert_by_vocab(self.inv_vocab, ids)
class BasicTokenizer(object):
"""Runs basic tokenization (punctuation splitting, lower casing, etc.)."""
def __init__(self, do_lower_case=True):
"""Constructs a BasicTokenizer.
Args:
do_lower_case: Whether to lower case the input.
"""
self.do_lower_case = do_lower_case
def tokenize(self, text):
"""Tokenizes a piece of text."""
text = convert_to_unicode(text)
text = self._clean_text(text)
# This was added on November 1st, 2018 for the multilingual and Chinese
# models. This is also applied to the English models now, but it doesn't
# matter since the English models were not trained on any Chinese data
# and generally don't have any Chinese data in them (there are Chinese
# characters in the vocabulary because Wikipedia does have some Chinese
# words in the English Wikipedia.).
text = self._tokenize_chinese_chars(text)
orig_tokens = whitespace_tokenize(text)
split_tokens = []
for token in orig_tokens:
if self.do_lower_case:
token = token.lower()
token = self._run_strip_accents(token)
split_tokens.extend(self._run_split_on_punc(token))
output_tokens = whitespace_tokenize(" ".join(split_tokens))
return output_tokens
def _run_strip_accents(self, text):
"""Strips accents from a piece of text."""
text = unicodedata.normalize("NFD", text)
output = []
for char in text:
cat = unicodedata.category(char)
if cat == "Mn":
continue
output.append(char)
return "".join(output)
def _run_split_on_punc(self, text):
"""Splits punctuation on a piece of text."""
chars = list(text)
i = 0
start_new_word = True
output = []
while i < len(chars):
char = chars[i]
if _is_punctuation(char):
output.append([char])
start_new_word = True
else:
if start_new_word:
output.append([])
start_new_word = False
output[-1].append(char)
i += 1
return ["".join(x) for x in output]
def _tokenize_chinese_chars(self, text):
"""Adds whitespace around any CJK character."""
output = []
for char in text:
cp = ord(char)
if self._is_chinese_char(cp):
output.append(" ")
output.append(char)
output.append(" ")
else:
output.append(char)
return "".join(output)
def _is_chinese_char(self, cp):
"""Checks whether CP is the codepoint of a CJK character."""
# This defines a "chinese character" as anything in the CJK Unicode block:
# https://en.wikipedia.org/wiki/CJK_Unified_Ideographs_(Unicode_block)
#
# Note that the CJK Unicode block is NOT all Japanese and Korean characters,
# despite its name. The modern Korean Hangul alphabet is a different block,
# as is Japanese Hiragana and Katakana. Those alphabets are used to write
# space-separated words, so they are not treated specially and handled
# like the all of the other languages.
if (
(cp >= 0x4E00 and cp <= 0x9FFF)
or (cp >= 0x3400 and cp <= 0x4DBF) #
or (cp >= 0x20000 and cp <= 0x2A6DF) #
or (cp >= 0x2A700 and cp <= 0x2B73F) #
or (cp >= 0x2B740 and cp <= 0x2B81F) #
or (cp >= 0x2B820 and cp <= 0x2CEAF) #
or (cp >= 0xF900 and cp <= 0xFAFF)
or (cp >= 0x2F800 and cp <= 0x2FA1F) #
): #
return True
return False
def _clean_text(self, text):
"""Performs invalid character removal and whitespace cleanup on text."""
output = []
for char in text:
cp = ord(char)
if cp == 0 or cp == 0xFFFD or _is_control(char):
continue
if _is_whitespace(char):
output.append(" ")
else:
output.append(char)
return "".join(output)
class WordpieceTokenizer(object):
"""Runs WordPiece tokenization."""
def __init__(self, vocab, unk_token="[UNK]", max_input_chars_per_word=100):
self.vocab = vocab
self.unk_token = unk_token
self.max_input_chars_per_word = max_input_chars_per_word
def tokenize(self, text):
"""Tokenizes a piece of text into its word pieces.
This uses a greedy longest-match-first algorithm to perform tokenization
using the given vocabulary.
For example:
input = "unaffable"
output = ["un", "##aff", "##able"]
Args:
text: A single token or whitespace separated tokens. This should have
already been passed through `BasicTokenizer.
Returns:
A list of wordpiece tokens.
"""
text = convert_to_unicode(text)
output_tokens = []
for token in whitespace_tokenize(text):
chars = list(token)
if len(chars) > self.max_input_chars_per_word:
output_tokens.append(self.unk_token)
continue
is_bad = False
start = 0
sub_tokens = []
while start < len(chars):
end = len(chars)
cur_substr = None
while start < end:
substr = "".join(chars[start:end])
if start > 0:
substr = "##" + substr
if substr in self.vocab:
cur_substr = substr
break
end -= 1
if cur_substr is None:
is_bad = True
break
sub_tokens.append(cur_substr)
start = end
if is_bad:
output_tokens.append(self.unk_token)
else:
output_tokens.extend(sub_tokens)
return output_tokens
def _is_whitespace(char):
"""Checks whether `chars` is a whitespace character."""
# \t, \n, and \r are technically control characters but we treat them
# as whitespace since they are generally considered as such.
if char == " " or char == "\t" or char == "\n" or char == "\r":
return True
cat = unicodedata.category(char)
if cat == "Zs":
return True
return False
def _is_control(char):
"""Checks whether `chars` is a control character."""
# These are technically control characters but we count them as whitespace
# characters.
if char == "\t" or char == "\n" or char == "\r":
return False
cat = unicodedata.category(char)
if cat.startswith("C"):
return True
return False
def _is_punctuation(char):
"""Checks whether `chars` is a punctuation character."""
cp = ord(char)
# We treat all non-letter/number ASCII as punctuation.
# Characters such as "^", "$", and "`" are not in the Unicode
# Punctuation class but we treat them as punctuation anyways, for
# consistency.
if (cp >= 33 and cp <= 47) or (cp >= 58 and cp <= 64) or (cp >= 91 and cp <= 96) or (cp >= 123 and cp <= 126):
return True
cat = unicodedata.category(char)
if cat.startswith("P"):
return True
return False
def tokenize_chinese_chars(text):
"""Adds whitespace around any CJK character."""
def _is_chinese_char(cp):
"""Checks whether CP is the codepoint of a CJK character."""
# This defines a "chinese character" as anything in the CJK Unicode block:
# https://en.wikipedia.org/wiki/CJK_Unified_Ideographs_(Unicode_block)
#
# Note that the CJK Unicode block is NOT all Japanese and Korean characters,
# despite its name. The modern Korean Hangul alphabet is a different block,
# as is Japanese Hiragana and Katakana. Those alphabets are used to write
# space-separated words, so they are not treated specially and handled
# like the all of the other languages.
if (
(cp >= 0x4E00 and cp <= 0x9FFF)
or (cp >= 0x3400 and cp <= 0x4DBF) #
or (cp >= 0x20000 and cp <= 0x2A6DF) #
or (cp >= 0x2A700 and cp <= 0x2B73F) #
or (cp >= 0x2B740 and cp <= 0x2B81F) #
or (cp >= 0x2B820 and cp <= 0x2CEAF) #
or (cp >= 0xF900 and cp <= 0xFAFF)
or (cp >= 0x2F800 and cp <= 0x2FA1F) #
): #
return True
return False
def _is_whitespace(c):
if c == " " or c == "\t" or c == "\r" or c == "\n" or ord(c) == 0x202F:
return True
return False
output = []
buff = ""
for char in text:
cp = ord(char)
if _is_chinese_char(cp) or _is_whitespace(char):
if buff != "":
output.append(buff)
buff = ""
output.append(char)
else:
buff += char
if buff != "":
output.append(buff)
return output
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Finetuning on classification tasks."""
from __future__ import absolute_import, division, print_function, unicode_literals
import logging
import multiprocessing
import os
import time
import warnings
# NOTE(paddle-dev): All of these flags should be
# set before `import paddle`. Otherwise, it would
# not take any effect.
os.environ["FLAGS_eager_delete_tensor_gb"] = "0" # enable gc
import paddle # noqa: E402
import paddle.fluid as fluid # noqa: E402
if hasattr(paddle, "enable_static"):
paddle.enable_static()
import paddle.fluid.incubate.fleet.base.role_maker as role_maker # noqa: E402
import reader_ce as reader_ce # noqa: E402
from cross_encoder import create_model, evaluate, predict # noqa: E402
from finetune_args import parser # noqa: E402
from model.ernie import ErnieConfig # noqa: E402
from optimization import optimization # noqa: E402
from paddle.fluid.incubate.fleet.collective import ( # noqa: E402
DistributedStrategy,
fleet,
)
from src.utils.args import check_cuda, prepare_logger, print_arguments # noqa: E402
from src.utils.init import init_checkpoint, init_pretraining_params # noqa: E402
warnings.filterwarnings("ignore")
args = parser.parse_args()
log = logging.getLogger()
def main(args):
ernie_config = ErnieConfig(args.ernie_config_path)
ernie_config.print_config()
if args.use_cuda:
dev_list = fluid.cuda_places()
place = dev_list[0]
dev_count = len(dev_list)
else:
place = fluid.CPUPlace()
dev_count = int(os.environ.get("CPU_NUM", multiprocessing.cpu_count()))
exe = fluid.Executor(place)
reader = reader_ce.ClassifyReader(
vocab_path=args.vocab_path,
label_map_config=args.label_map_config,
max_seq_len=args.max_seq_len,
total_num=args.train_data_size,
do_lower_case=args.do_lower_case,
in_tokens=args.in_tokens,
random_seed=args.random_seed,
tokenizer=args.tokenizer,
for_cn=args.for_cn,
task_id=args.task_id,
)
if not (args.do_train or args.do_val or args.do_test):
raise ValueError("For args `do_train`, `do_val` and `do_test`, at " "least one of them must be True.")
if args.do_test:
assert args.test_save is not None
startup_prog = fluid.Program()
if args.random_seed is not None:
startup_prog.random_seed = args.random_seed
if args.predict_batch_size is None:
args.predict_batch_size = args.batch_size
if args.do_train:
role = role_maker.PaddleCloudRoleMaker(is_collective=True)
fleet.init(role)
dev_count = fleet.worker_num()
train_data_generator = reader.data_generator(
input_file=args.train_set,
batch_size=args.batch_size,
epoch=args.epoch,
dev_count=1,
trainer_id=fleet.worker_index(),
trainer_num=fleet.worker_num(),
shuffle=True,
phase="train",
)
num_train_examples = reader.get_num_examples(args.train_set)
if args.in_tokens:
max_train_steps = args.epoch * num_train_examples // (args.batch_size // args.max_seq_len) // dev_count
else:
max_train_steps = args.epoch * num_train_examples // args.batch_size // dev_count
warmup_steps = int(max_train_steps * args.warmup_proportion)
log.info("Device count: %d" % dev_count)
log.info("Num train examples: %d" % num_train_examples)
log.info("Max train steps: %d" % max_train_steps)
log.info("Num warmup steps: %d" % warmup_steps)
train_program = fluid.Program()
# use fleet api
exec_strategy = fluid.ExecutionStrategy()
if args.use_fast_executor:
exec_strategy.use_experimental_executor = True
exec_strategy.num_threads = dev_count
if args.is_distributed:
exec_strategy.num_threads = 3
exec_strategy.num_iteration_per_drop_scope = args.num_iteration_per_drop_scope
dist_strategy = DistributedStrategy()
dist_strategy.exec_strategy = exec_strategy
dist_strategy.nccl_comm_num = 1
if args.is_distributed:
dist_strategy.nccl_comm_num = 2
dist_strategy.use_hierarchical_allreduce = True
if args.use_mix_precision:
dist_strategy.use_amp = True
with fluid.program_guard(train_program, startup_prog):
with fluid.unique_name.guard():
train_pyreader, graph_vars = create_model(
args, pyreader_name="train_reader", ernie_config=ernie_config
)
scheduled_lr = optimization(
loss=graph_vars["loss"],
warmup_steps=warmup_steps,
num_train_steps=max_train_steps,
learning_rate=args.learning_rate,
train_program=train_program,
startup_prog=startup_prog,
weight_decay=args.weight_decay,
scheduler=args.lr_scheduler,
use_dynamic_loss_scaling=args.use_dynamic_loss_scaling,
incr_every_n_steps=args.incr_every_n_steps,
decr_every_n_nan_or_inf=args.decr_every_n_nan_or_inf,
incr_ratio=args.incr_ratio,
decr_ratio=args.decr_ratio,
dist_strategy=dist_strategy,
)
if args.verbose:
if args.in_tokens:
lower_mem, upper_mem, unit = fluid.contrib.memory_usage(
program=train_program, batch_size=args.batch_size // args.max_seq_len
)
else:
lower_mem, upper_mem, unit = fluid.contrib.memory_usage(
program=train_program, batch_size=args.batch_size
)
log.info("Theoretical memory usage in training: %.3f - %.3f %s" % (lower_mem, upper_mem, unit))
if args.do_val or args.do_test:
test_prog = fluid.Program()
with fluid.program_guard(test_prog, startup_prog):
with fluid.unique_name.guard():
test_pyreader, graph_vars = create_model(
args, pyreader_name="test_reader", ernie_config=ernie_config, is_prediction=True
)
test_prog = test_prog.clone(for_test=True)
train_program = fleet.main_program
exe = fluid.Executor(place)
exe.run(startup_prog)
if args.do_train:
if args.init_checkpoint and args.init_pretraining_params:
log.warning(
"WARNING: args 'init_checkpoint' and 'init_pretraining_params' "
"both are set! Only arg 'init_checkpoint' is made valid."
)
if args.init_checkpoint:
init_checkpoint(exe, args.init_checkpoint, main_program=startup_prog)
elif args.init_pretraining_params:
init_pretraining_params(exe, args.init_pretraining_params, main_program=startup_prog)
elif args.do_val or args.do_test:
if not args.init_checkpoint:
raise ValueError("args 'init_checkpoint' should be set if" "only doing validation or testing!")
init_checkpoint(exe, args.init_checkpoint, main_program=startup_prog)
if args.do_train:
train_exe = exe
train_pyreader.decorate_tensor_provider(train_data_generator)
else:
train_exe = None
test_exe = exe
current_epoch = 0
steps = 0
if args.do_train:
train_pyreader.start()
if warmup_steps > 0:
graph_vars["learning_rate"] = scheduled_lr
ce_info = []
time_begin = time.time()
last_epoch = 0
while True:
try:
steps += 1
if fleet.worker_index() != 0:
train_exe.run(fetch_list=[], program=train_program)
continue
if steps % args.skip_steps != 0:
train_exe.run(fetch_list=[], program=train_program)
else:
outputs = evaluate(
train_exe, train_program, train_pyreader, graph_vars, "train", metric=args.metric
)
if args.verbose:
verbose = "train pyreader queue size: %d, " % train_pyreader.queue.size()
verbose += "learning rate: %f" % (
outputs["learning_rate"] if warmup_steps > 0 else args.learning_rate
)
log.info(verbose)
current_example, current_epoch = reader.get_train_progress()
time_end = time.time()
used_time = time_end - time_begin
log.info(
"epoch: %d, progress: %d/%d, step: %d, ave loss: %f, "
"ave acc: %f, speed: %f steps/s"
% (
current_epoch,
current_example * dev_count,
num_train_examples,
steps,
outputs["loss"],
outputs["accuracy"],
args.skip_steps / used_time,
)
)
ce_info.append([outputs["loss"], outputs["accuracy"], used_time])
time_begin = time.time()
if steps % args.save_steps == 0:
save_path = os.path.join(args.checkpoints, "step_" + str(steps))
fluid.io.save_persistables(exe, save_path, fleet._origin_program)
if steps % args.validation_steps == 0:
# evaluate dev set
if args.do_val:
evaluate_wrapper(args, reader, exe, test_prog, test_pyreader, graph_vars, current_epoch, steps)
if args.do_test:
predict_wrapper(args, reader, exe, test_prog, test_pyreader, graph_vars, current_epoch, steps)
if last_epoch != current_epoch:
last_epoch = current_epoch
except fluid.core.EOFException:
save_path = os.path.join(args.checkpoints, "step_" + str(steps))
fluid.io.save_persistables(exe, save_path, fleet._origin_program)
train_pyreader.reset()
break
# final eval on dev set
if args.do_val:
evaluate_wrapper(args, reader, exe, test_prog, test_pyreader, graph_vars, current_epoch, steps)
# final eval on test set
if args.do_test:
predict_wrapper(args, reader, exe, test_prog, test_pyreader, graph_vars)
# final eval on diagnostic, hack for glue-ax
if args.diagnostic:
test_pyreader.decorate_tensor_provider(
reader.data_generator(args.diagnostic, batch_size=args.batch_size, epoch=1, dev_count=1, shuffle=False)
)
log.info("Final diagnostic")
qids, preds, probs = predict(test_exe, test_prog, test_pyreader, graph_vars)
assert len(qids) == len(preds), "{} v.s. {}".format(len(qids), len(preds))
with open(args.diagnostic_save, "w") as f:
for id, s, p in zip(qids, preds, probs):
f.write("{}\t{}\t{}\n".format(id, s, p))
log.info("Done final diagnostic, saving to {}".format(args.diagnostic_save))
def evaluate_wrapper(args, reader, exe, test_prog, test_pyreader, graph_vars, epoch, steps):
# evaluate dev set
for ds in args.dev_set.split(","):
test_pyreader.decorate_tensor_provider(
reader.data_generator(ds, batch_size=args.predict_batch_size, epoch=1, dev_count=1, shuffle=False)
)
log.info("validation result of dataset {}:".format(ds))
evaluate_info = evaluate(exe, test_prog, test_pyreader, graph_vars, "dev", metric=args.metric)
log.info(evaluate_info + ", file: {}, epoch: {}, steps: {}".format(ds, epoch, steps))
def predict_wrapper(args, reader, exe, test_prog, test_pyreader, graph_vars, epoch=None, steps=None):
test_sets = args.test_set.split(",")
save_dirs = args.test_save.split(",")
assert len(test_sets) == len(save_dirs)
for test_f, save_f in zip(test_sets, save_dirs):
test_pyreader.decorate_tensor_provider(
reader.data_generator(test_f, batch_size=args.predict_batch_size, epoch=1, dev_count=1, shuffle=False)
)
if epoch is not None or steps is not None:
save_path = save_f + "." + str(epoch) + "." + str(steps)
else:
save_path = save_f
log.info("testing {}, save to {}".format(test_f, save_path))
qids, preds, probs = predict(exe, test_prog, test_pyreader, graph_vars)
save_dir = os.path.dirname(save_path)
if not os.path.exists(save_dir):
os.makedirs(save_dir)
else:
log.warning("save dir exists: %s, will skip saving" % save_dir)
with open(save_path, "w") as f:
for p in probs:
f.write("{}\n".format(p[1]))
if __name__ == "__main__":
prepare_logger(log)
print_arguments(args)
check_cuda(args.use_cuda)
main(args)
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Arguments for configuration."""
from __future__ import absolute_import, division, print_function, unicode_literals
import logging
import os
import sys
import paddle.fluid as fluid
import six
from paddlenlp.trainer.argparser import strtobool
log = logging.getLogger(__name__)
def prepare_logger(logger, debug=False, save_to_file=None):
formatter = logging.Formatter(fmt="[%(levelname)s] %(asctime)s [%(filename)12s:%(lineno)5d]:\t%(message)s")
console_hdl = logging.StreamHandler()
console_hdl.setFormatter(formatter)
logger.addHandler(console_hdl)
if save_to_file is not None and not os.path.exists(save_to_file):
file_hdl = logging.FileHandler(save_to_file)
file_hdl.setFormatter(formatter)
logger.addHandler(file_hdl)
logger.setLevel(logging.DEBUG)
logger.propagate = False
class ArgumentGroup(object):
def __init__(self, parser, title, des):
self._group = parser.add_argument_group(title=title, description=des)
def add_arg(self, name, type, default, help, positional_arg=False, **kwargs):
prefix = "" if positional_arg else "--"
type = strtobool if type == bool else type
self._group.add_argument(
prefix + name, default=default, type=type, help=help + " Default: %(default)s.", **kwargs
)
def print_arguments(args):
log.info("----------- Configuration Arguments -----------")
for arg, value in sorted(six.iteritems(vars(args))):
log.info("%s: %s" % (arg, value))
log.info("------------------------------------------------")
def check_cuda(
use_cuda,
err="\nYou can not set use_cuda = True in the model because you are using paddlepaddle-cpu.\n \
Please: 1. Install paddlepaddle-gpu to run your models on GPU or 2. Set use_cuda = False to run models on CPU.\n",
):
try:
if use_cuda is True and fluid.is_compiled_with_cuda() is False:
log.error(err)
sys.exit(1)
except Exception:
pass
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import logging
import os
import paddle.fluid as fluid
log = logging.getLogger(__name__)
def init_checkpoint(exe, init_checkpoint_path, main_program):
assert os.path.exists(init_checkpoint_path), "[%s] cann't be found." % init_checkpoint_path
def existed_persitables(var):
if not fluid.io.is_persistable(var):
return False
if not os.path.exists(os.path.join(init_checkpoint_path, var.name)):
print("Var not exists: [%s]\t%s" % (var.name, os.path.join(init_checkpoint_path, var.name)))
# else:
# print ("Var exists: [%s]" % (var.name))
return os.path.exists(os.path.join(init_checkpoint_path, var.name))
fluid.io.load_vars(exe, init_checkpoint_path, main_program=main_program, predicate=existed_persitables)
log.info("Load model from {}".format(init_checkpoint_path))
def init_pretraining_params(exe, pretraining_params_path, main_program):
assert os.path.exists(pretraining_params_path), "[%s] cann't be found." % pretraining_params_path
def existed_params(var):
if not isinstance(var, fluid.framework.Parameter):
return False
if not os.path.exists(os.path.join(pretraining_params_path, var.name)):
print("Var not exists: [%s]\t%s" % (var.name, os.path.join(pretraining_params_path, var.name)))
# else:
# print ("Var exists: [%s]" % (var.name))
return os.path.exists(os.path.join(pretraining_params_path, var.name))
fluid.io.load_vars(exe, pretraining_params_path, main_program=main_program, predicate=existed_params)
log.info("Load pretraining parameters from {}.".format(pretraining_params_path))
# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
export CUDA_VISIBLE_DEVICES=0
QUESTION=$1
# Question: NFC咋开门
if [ $# != 1 ];then
echo "USAGE: sh script/run_cross_encoder_test.sh \$QUESTION"
exit 1
fi
# compute scores for QUESTION and OCR parsing results with Rerank module
cd Rerank
bash run_test.sh ${QUESTION}
cd ..
# extraction answer for QUESTION from the top1 of rank
cd Extraction
bash run_test.sh ${QUESTION}
cd ..
[ERNIE-Layout](../../../model_zoo/ernie-layout)
简体中文 | [English](README_en.md)
# 信息抽取应用
**目录**
- [1. 信息抽取应用简介](#1)
- [2. 技术特色](#2)
- [2.1 信息抽取方案全覆盖](#21)
- [2.2 强大的训练基座](#22)
- [2.3 产业级全流程方案](#23)
- [2.4 效果展示](#24)
- [3. 快速开始](#快速开始)
- [3.1 Taskflow开箱即用](#31)
- [3.2 文本信息抽取](#32)
- [3.3 文档信息抽取](#33)
<a name="1"></a>
## 1. 信息抽取应用简介
信息抽取应用针对信息抽取一系列高频场景开源了产业级解决方案,**具备多领域、多任务、跨模态的能力**,打通**数据标注-模型训练-模型调优-预测部署全流程**,可快速实现信息抽取产品落地。
信息抽取通俗地说就是从给定的文本/图片等输入数据中抽取出结构化信息的过程。在信息抽取的落地过程中通常面临领域多变、任务多样、数据稀缺等许多挑战。针对信息抽取领域的难点和痛点,PaddleNLP信息抽取应用**基于UIE统一建模的思想**,提供了信息抽取产业级应用方案,**除支持纯文本场景实体、关系、事件、观点等不同任务抽取外,还支持文档/图片/表格的端到端信息抽取**。该应用**不限定行业领域和抽取目标**,可实现从产品原型研发、业务POC阶段到业务落地、迭代阶段的无缝衔接,助力开发者实现特定领域抽取场景的快速适配与落地。
**信息抽取应用亮点:**
- **覆盖场景全面🎓:** 覆盖信息抽取各类主流任务,面向纯文本和文档场景,支持多语言,满足开发者多样信息抽取落地需求。
- **效果领先🏃:** 以在纯文本、多模态上均有突出效果的UIE系列模型作为训练基座,提供多种尺寸的预训练模型满足不同需求,具有广泛成熟的实践应用性。
- **简单易用⚡:** 通过Taskflow实现三行代码可实现无标注数据的情况下进行快速调用,一行命令即可开启信息抽取训练,轻松完成部署上线,降低信息抽取技术落地门槛。
- **高效调优✊:** 开发者无需机器学习背景知识,即可轻松上手数据标注及模型训练流程。
<a name="2"></a>
## 2. 技术特色
<a name="21"></a>
### 2.1 信息抽取方案全覆盖
多模型选择,满足精度、速度,适配不同信息抽取使用场景。
| 模型名称 | 使用场景 | 支持任务 |
| :----------------------------------------------------------: | :--------------------------------------------------------- | :--------------------------------------------------- |
| `uie-base`<br />`uie-medium`<br />`uie-mini`<br />`uie-micro`<br />`uie-nano` | 面向**纯文本**场景的**抽取式**模型,支持**中文** | 具备实体、关系、事件、评论观点等通用信息抽取能力 |
| `uie-base-en` | 面向**纯文本**场景的**抽取式**模型,支持**英文** | 具备实体、关系、事件、评论观点等通用信息抽取能力 |
| `uie-m-base`<br />`uie-m-large` | 面向**纯文本**场景的**抽取式**模型,支持**中英** | 具备实体、关系、事件、评论观点等通用信息抽取能力 |
| <b>`uie-x-base`</b> | 面向**纯文本****文档**场景的**抽取式**模型,支持**中英** | 支持纯文本场景的全部功能,还支持文档/图片/表格的端到端信息抽取 |
<a name="22"></a>
### 2.2 强大的训练基座
信息抽取应用使用ERNIE 3.0轻量级模型作为预训练模型,同时在大量信息抽取数据上进行了二次预训练,从而让模型适配固定prompt。
- 中文文本数据集实验效果
我们在互联网、医疗、金融三大垂类文本自建测试集上进行了实验:
<table>
<tr><th row_span='2'><th colspan='2'>金融<th colspan='2'>医疗<th colspan='2'>互联网
<tr><td><th>0-shot<th>5-shot<th>0-shot<th>5-shot<th>0-shot<th>5-shot
<tr><td>uie-base (12L768H)<td>46.43<td>70.92<td><b>71.83</b><td>85.72<td>78.33<td>81.86
<tr><td>uie-medium (6L768H)<td>41.11<td>64.53<td>65.40<td>75.72<td>78.32<td>79.68
<tr><td>uie-mini (6L384H)<td>37.04<td>64.65<td>60.50<td>78.36<td>72.09<td>76.38
<tr><td>uie-micro (4L384H)<td>37.53<td>62.11<td>57.04<td>75.92<td>66.00<td>70.22
<tr><td>uie-nano (4L312H)<td>38.94<td>66.83<td>48.29<td>76.74<td>62.86<td>72.35
<tr><td>uie-m-large (24L1024H)<td><b>49.35</b><td><b>74.55</b><td>70.50<td><b>92.66</b><td>78.49<td><b>83.02</b>
<tr><td>uie-m-base (12L768H)<td>38.46<td>74.31<td>63.37<td>87.32<td>76.27<td>80.13
<tr><td>🧾 🎓<b>uie-x-base (12L768H)</b><td>48.84<td>73.87<td>65.60<td>88.81<td><b>79.36</b><td>81.65
</table>
0-shot表示无训练数据直接通过```paddlenlp.Taskflow```进行预测,5-shot表示每个类别包含5条标注数据进行模型微调。**实验表明UIE在垂类场景可以通过少量数据(few-shot)进一步提升效果**
- 多模态数据集实验效果
我们在通用、金融、医疗三大场景自建多模态测试集上对UIE-X的零样本效果进行了实验:
<table>
<tr><th ><th>通用<th>金融<th colspan='2'>医疗
<tr><td>🧾 🎓<b>uie-x-base (12L768H)</b><td>65.03<td>73.51<td>84.24
</table>
通用测试集包含了不同领域的复杂样本,抽取难度最大。
<a name="23"></a>
### 2.3 产业级全流程方案
**调研阶段**
- 该阶段目标需求开放且缺少数据积累。我们提供Taskflow三行代码极简调用的方式,无需标注数据即可在业务场景上快速验证效果。
- [文本抽取 Taskflow使用指南](./taskflow_text.md)
- [文档抽取 Taskflow使用指南](./taskflow_doc.md)
**数据准备阶段**
- 我们推荐在实际的业务场景中定制自己的信息抽取模型。我们提供了不同抽取场景的Label Studio标注解决方案,可基于该方案实现从数据标注到训练数据构造的无缝衔接,大大降低了数据标注、模型定制的时间成本。
- [文本抽取标注指南](./label_studio_text.md)
- [文档抽取标注指南](./label_studio_doc.md)
**模型微调及封闭域蒸馏**
- 基于UIE优秀的小样本微调能力,实现低成本模型定制适配。同时提供封闭域蒸馏的加速方案,解决抽取速度慢的问题。
- [文本信息抽取全流程示例](./text/README.md)
- [文档信息抽取全流程示例](./document/README.md)
**模型部署**
- 提供HTTP部署方案,快速实现定制模型的部署上线。
- [文本抽取HTTP部署指南](./text/deploy/simple_serving/README.md)
- [文档抽取HTTP部署指南](./document/deploy/simple_serving/README.md)
<a name="24"></a>
### 2.4 效果展示
- 🧾 通过[Huggingface网页](https://huggingface.co/spaces/PaddlePaddle/UIE-X)体验UIE-X功能:
<div align="center">
<img src=https://user-images.githubusercontent.com/40840292/207856955-a01cd5dd-fd5c-48ae-b8fd-c69512a88845.png height=500 width=900 hspace='10'/>
</div>
- UIE-X端到端文档抽取产业应用示例
- 报关单
<div align="center">
<img src=https://user-images.githubusercontent.com/40840292/205879840-239ada90-1692-40e4-a17f-c5e963fdd204.png height=800 width=500 />
</div>
- Delivery Note(需微调)
<div align="center">
<img src=https://user-images.githubusercontent.com/40840292/205922422-f2615050-83cb-4bf5-8887-461f5633e85c.png height=250 width=700 />
</div>
- 增值税发票(需微调)
<div align="center">
<img src=https://user-images.githubusercontent.com/40840292/206084942-44ba477c-9244-4ce2-bbb5-ba430c9b926e.png height=550 width=700 />
</div>
- 表单(需微调)
<div align="center">
<img src=https://user-images.githubusercontent.com/40840292/207856330-7aa0d158-47e0-477f-a88f-e23a040504a3.png height=400 width=700 />
</div>
<a name="3"></a>
## 3. 快速开始
<a name="31"></a>
### 3.1 Taskflow开箱即用
- 通过Taskflow实现开箱即用
👉 [文本抽取 Taskflow使用指南](./taskflow_text.md)
👉 [文档抽取 Taskflow使用指南](./taskflow_doc.md)
<a name="32"></a>
### 3.2 文本信息抽取
- 快速开启文本信息抽取 👉 [文本信息抽取指南](./text/README.md)
<a name="33"></a>
### 3.3 文档信息抽取
- 快速开启文档信息抽取 👉 [文档信息抽取指南](./document/README.md)
# Information Extraction Application
**Table of contents**
- [1. Introduction](#1)
- [2. Features](#2)
- [2.1 Available Models](#21)
- [2.2 Performance](#22)
- [2.3 Full Development Lifecycle](#23)
- [2.4 Demo](#24)
- [3. Quick Start](#3)
- [3.1 Taskflow](#31)
- [3.2 Text Information Extraction](#32)
- [3.3 Document Information Extraction](#33)
<a name="1"></a>
## 1. Introduction
This Information Extraction (IE) guide introduces our open-source industry-grade solution that covers the most widely-used application scenarios of Information Extraction. It features **multi-domain, multi-task, and cross-modal capabilities** and goes through the full lifecycle of **data labeling, model training and model deployment**. We hope this guide can help you apply Information Extraction techniques in your own products or models.
Information Extraction (IE) is the process of extracting structured information from given input data such as text, pictures or scanned document. While IE brings immense value, applying IE techniques is never easy with challenges such as domain adaptation, heterogeneous structures, lack of labeled data, etc. This PaddleNLP Information Extraction Guide builds on the foundation of our work in [Universal Information Extraction] (https://arxiv.org/abs/2203.12277) and provides an industrial-level solution that not only supports **extracting entities, relations, events and opinions from plain text**, but also supports **cross-modal extraction out of documents, tables and pictures.** Our method features a flexible prompt, which allows you to specify extraction targets with simple natural language. We also provide a few different domain-adapated models specialized for different industry sectors.
**Highlights:**
- **Comprehensive Coverage🎓:** Covers various mainstream tasks of information extraction for plain text and document scenarios, supports multiple languages
- **State-of-the-Art Performance🏃:** Strong performance from the UIE model series models in plain text and multimodal datasets. We also provide pretrained models of various sizes to meet different needs
- **Easy to use⚡:** three lines of code to use our `Taskflow` for out-of-box Information Extraction capabilities. One line of command to model training and model deployment
- **Efficient Tuning✊:** Developers can easily get started with the data labeling and model training process without a background in Machine Learning.
<a name="2"></a>
## 2. Features
<a name="21"></a>
### 2.1 Available Models
Multiple model selection, satisfying accuracy and speed, and adapting to different information extraction scenarios.
| Model Name | Usage Scenarios | Supporting Tasks |
| :----------------------------------------------------------: | :--------------------------------------------------------- | :--------------------------------------------------- |
| `uie-base`<br />`uie-medium`<br />`uie-mini`<br />`uie-micro`<br />`uie-nano` | For **plain text** The **extractive** model of the scene supports **Chinese** | Supports entity, relation, event, opinion extraction |
| `uie-base-en` | An **extractive** model for **plain text** scenarios, supports **English** | Supports entity, relation, event, opinion extraction |
| `uie-m-base`<br />`uie-m-large` | An **extractive** model for **plain text** scenarios, supporting **Chinese and English** | Supports entity, relation, event, opinion extraction |
| <b>`uie-x-base`</b> | An **extractive** model for **plain text** and **document** scenarios, supports **Chinese and English** | Supports entity, relation, event, opinion extraction on both plain text and documents/pictures/tables |
<a name="22"></a>
### 2.2 Performance
The UIE model series uses the ERNIE 3.0 lightweight models as the pre-trained language models and was finetuned on a large amount of information extraction data so that the model can be adapted to a fixed prompt.
- Experimental results on Chinese dataset
We conducted experiments on the in-house test sets of the three different domains of Internet, medical care, and finance:
<table>
<tr><th row_span='2'><th colspan='2'>finance<th colspan='2'>healthcare<th colspan='2'>internet
<tr><td><th>0-shot<th>5-shot<th>0-shot<th>5-shot<th>0-shot<th>5-shot
<tr><td>uie-base (12L768H)<td>46.43<td>70.92<td><b>71.83</b><td>85.72<td>78.33<td>81.86
<tr><td>uie-medium (6L768H)<td>41.11<td>64.53<td>65.40<td>75.72<td>78.32<td>79.68
<tr><td>uie-mini (6L384H)<td>37.04<td>64.65<td>60.50<td>78.36<td>72.09<td>76.38
<tr><td>uie-micro (4L384H)<td>37.53<td>62.11<td>57.04<td>75.92<td>66.00<td>70.22
<tr><td>uie-nano (4L312H)<td>38.94<td>66.83<td>48.29<td>76.74<td>62.86<td>72.35
<tr><td>uie-m-large (24L1024H)<td><b>49.35</b><td><b>74.55</b><td>70.50<td><b>92.66</b ><td>78.49<td><b>83.02</b>
<tr><td>uie-m-base (12L768H)<td>38.46<td>74.31<td>63.37<td>87.32<td>76.27<td>80.13
<tr><td>🧾🎓<b>uie-x-base (12L768H)</b><td>48.84<td>73.87<td>65.60<td>88.81<td><b>79.36</b> <td>81.65
</table>
0-shot means that no training data is directly used for prediction through ```paddlenlp.Taskflow```, and 5-shot means that each category contains 5 pieces of labeled data for model fine-tuning. **Experiments show that UIE can further improve the performance with a small amount of data (few-shot)**.
- Experimental results on multimodal datasets
We experimented on the zero-shot performance of UIE-X on the in-house multi-modal test sets in three different domains of general, financial, and medical:
<table>
<tr><th ><th>General <th>Financial<th colspan='2'>Medical
<tr><td>🧾🎓<b>uie-x-base (12L768H)</b><td>65.03<td>73.51<td>84.24
</table>
The general test set contains complex samples from different fields and is the most difficult task.
<a name="23"></a>
### 2.3 Full Development Lifecycle
**Research stage**
- At this stage, the target requirements are open and there is no labeled data. We provide a simple way of using Taskflow out-of-the-box with three lines of code, which allows you to build POC without any labeled data.
- [Text Extraction Taskflow User Guide](./taskflow_text_en.md)
- [Document Extraction Taskflow User Guide](./taskflow_doc_en.md)
**Data preparation stage**
- We recommend finetuning your own information extraction model for your use case. We provide Label Studio labeling solutions for different extraction scenarios. Based on this solution, the seamless connection from data labeling to training data construction can be realized, which greatly reduces the time cost of data labeling and model customization.
- [Text Extraction Labeling Guide](./label_studio_text_en.md)
- [Document Extraction and Labeling Guide](./label_studio_doc_en.md).
**Model fine-tuning and closed domain distillation**
- Based on UIE's few-shot capabilities, it realizes low-cost model customization and adaptation. At the same time, it provides an acceleration solution for closed domain distillation to solve the problem of slow extraction speed.
- [Example of the whole process of text information extraction](./text/README_en.md)
- [Example of document information extraction process](./document/README_en.md)
**Model Deployment**
- Provide an HTTP deployment solution to quickly implement the deployment and launch of customized models.
- [Text Extract HTTP Deployment Guide](./text/deploy/simple_serving/README_en.md)
- [Document Extract HTTP Deployment Guide](./document/deploy/simple_serving/README_en.md)
<a name="24"></a>
### 2.4 Demo
- 🧾Try our UIE-X demo on [🤗 HuggingFace Space](https://huggingface.co/spaces/PaddlePaddle/UIE-X):
<div align="center">
<img src=https://user-images.githubusercontent.com/40840292/207856955-a01cd5dd-fd5c-48ae-b8fd-c69512a88845.png height=500 width=900 hspace='10'/>
</div>
- UIE-X end-to-end document extraction industry application example
- Customs declaration
<div align="center">
<img src=https://user-images.githubusercontent.com/40840292/205879840-239ada90-1692-40e4-a17f-c5e963fdd204.png height=800 width=500 />
</div>
- Delivery Note (Need fine-tuning)
<div align="center">
<img src=https://user-images.githubusercontent.com/40840292/205922422-f2615050-83cb-4bf5-8887-461f5633e85c.png height=250 width=700 />
</div>
- VAT invoice (need fine-tuning)
<div align="center">
<img src=https://user-images.githubusercontent.com/40840292/206084942-44ba477c-9244-4ce2-bbb5-ba430c9b926e.png height=550 width=700 />
</div>
- Form (need fine-tuning)
<div align="center">
<img src=https://user-images.githubusercontent.com/40840292/207856330-7aa0d158-47e0-477f-a88f-e23a040504a3.png height=400 width=700 />
</div>
<a name="3"></a>
## 3. Quick Start
<a name="31"></a>
### 3.1 Taskflow
- Out of the box with Taskflow
👉 [Text Extraction Taskflow User Guide](./taskflow_text_en.md)
👉 [Document Extraction Taskflow User Guide](./taskflow_doc_en.md)
<a name="32"></a>
### 3.2 Text Information Extraction
- Quickly start text information extraction 👉 [Text Information Extraction Guide](./text/README_en.md)
<a name="33"></a>
### 3.3 Document Information Extraction
- Quickly open document information extraction 👉 [Document Information Extraction Guide](./document/README_en.md)
简体中文 | [English](README_en.md)
# 文档信息抽取
**目录**
- [1. 文档信息抽取应用](#1)
- [2. 快速开始](#2)
- [2.1 代码结构](#代码结构)
- [2.2 数据标注](#数据标注)
- [2.3 模型微调](#模型微调)
- [2.4 模型评估](#模型评估)
- [2.5 定制模型一键预测](#定制模型一键预测)
- [2.6 实验指标](#实验指标)
<a name="1"></a>
## 1. 文档信息抽取应用
本项目提供基于UIE微调的文档抽取端到端应用方案,打通**数据标注-模型训练-模型调优-预测部署全流程**,可快速实现文档信息抽取产品落地。
信息抽取通俗地说就是从给定的文本/图片等输入数据中抽取出结构化信息的过程。在信息抽取的落地过程中通常面临领域多变、任务多样、数据稀缺等许多挑战。针对信息抽取领域的难点和痛点,PaddleNLP信息抽取应用UIE统一建模的思想,提供了文档信息抽取产业级应用方案,支持**文档/图片/表格和纯文本场景下实体、关系、事件、观点等不同任务信息抽取**。该应用**不限定行业领域和抽取目标**,可实现从产品原型研发、业务POC阶段到业务落地、迭代阶段的无缝衔接,助力开发者实现特定领域抽取场景的快速适配与落地。
**文档信息抽取应用亮点:**
- **覆盖场景全面🎓:** 覆盖文档信息抽取各类主流任务,支持多语言,满足开发者多样信息抽取落地需求。
- **效果领先🏃:** 以在多模态信息抽取上有突出效果的模型UIE-X作为训练基座,具有广泛成熟的实践应用性。
- **简单易用⚡:** 通过Taskflow实现三行代码可实现无标注数据的情况下进行快速调用,一行命令即可开启信息抽取训练,轻松完成部署上线,降低信息抽取技术落地门槛。
- **高效调优✊:** 开发者无需机器学习背景知识,即可轻松上手数据标注及模型训练流程。
<a name="2"></a>
## 2. 快速开始
对于简单的抽取目标可以直接使用```paddlenlp.Taskflow```实现零样本(zero-shot)抽取,对于细分场景我们推荐使用定制功能(标注少量数据进行模型微调)以进一步提升效果。
<a name="代码结构"></a>
### 2.1 代码结构
```shell
.
├── deploy # 部署目录
│ └── simple_serving # 基于PaddleNLP SimpleServing 服务化部署
├── utils.py # 数据处理工具
├── finetune.py # 模型微调、压缩脚本
├── evaluate.py # 模型评估脚本
└── README.md
```
<a name="数据标注"></a>
### 2.2 数据标注
我们推荐使用 [Label Studio](https://labelstud.io/) 进行文档信息抽取数据标注,本项目打通了从数据标注到训练的通道,也即Label Studio导出数据可以通过 [label_studio.py](../label_studio.py) 脚本轻松将数据转换为输入模型时需要的形式,实现无缝衔接。标注方法的详细介绍请参考 [Label Studio数据标注指南](../label_studio_doc.md)
这里我们提供预先标注好的`增值税发票数据集`的文件,可以运行下面的命令行下载数据集,我们将展示如何使用数据转化脚本生成训练/验证/测试集文件,并使用UIE-X模型进行微调。
下载增值税发票数据集:
```shell
wget https://paddlenlp.bj.bcebos.com/datasets/tax.tar.gz
tar -zxvf tax.tar.gz
mv tax data
rm tax.tar.gz
```
生成训练/验证集文件:
```shell
python ../label_studio.py \
--label_studio_file ./data/label_studio.json \
--save_dir ./data \
--splits 0.8 0.2 0 \
--task_type ext
```
生成训练/验证集文件,可以使用PP-Structure的布局分析优化OCR结果的排序:
```shell
python ../label_studio.py \
--label_studio_file ./data/label_studio.json \
--save_dir ./data \
--splits 0.8 0.2 0\
--task_type ext \
--layout_analysis True
```
更多不同类型任务(含实体抽取、关系抽取、文档分类等)的标注规则及参数说明,请参考[Label Studio数据标注指南](../label_studio_doc.md)
<a name="模型微调"></a>
### 2.3 模型微调
推荐使用 [Trainer API ](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/docs/trainer.md) 对模型进行微调。只需输入模型、数据集等就可以使用 Trainer API 高效快速地进行预训练、微调和模型压缩等任务,可以一键启动多卡训练、混合精度训练、梯度累积、断点重启、日志显示等功能,Trainer API 还针对训练过程的通用训练配置做了封装,比如:优化器、学习率调度等。
使用下面的命令,使用 `uie-x-base` 作为预训练模型进行模型微调,将微调后的模型保存至`./checkpoint/model_best`
单卡启动:
```shell
python finetune.py \
--device gpu \
--logging_steps 5 \
--save_steps 25 \
--eval_steps 25 \
--seed 42 \
--model_name_or_path uie-x-base \
--output_dir ./checkpoint/model_best \
--train_path data/train.txt \
--dev_path data/dev.txt \
--max_seq_len 512 \
--per_device_train_batch_size 8 \
--per_device_eval_batch_size 8 \
--num_train_epochs 10 \
--learning_rate 1e-5 \
--do_train \
--do_eval \
--do_export \
--export_model_dir ./checkpoint/model_best \
--overwrite_output_dir \
--disable_tqdm True \
--metric_for_best_model eval_f1 \
--load_best_model_at_end True \
--save_total_limit 1
```
如果在GPU环境中使用,可以指定gpus参数进行多卡训练:
```shell
python -u -m paddle.distributed.launch --gpus "0" finetune.py \
--device gpu \
--logging_steps 5 \
--save_steps 25 \
--eval_steps 25 \
--seed 42 \
--model_name_or_path uie-x-base \
--output_dir ./checkpoint/model_best \
--train_path data/train.txt \
--dev_path data/dev.txt \
--max_seq_len 512 \
--per_device_train_batch_size 8 \
--per_device_eval_batch_size 8 \
--num_train_epochs 10 \
--learning_rate 1e-5 \
--do_train \
--do_eval \
--do_export \
--export_model_dir ./checkpoint/model_best \
--overwrite_output_dir \
--disable_tqdm True \
--metric_for_best_model eval_f1 \
--load_best_model_at_end True \
--save_total_limit 1
```
该示例代码中由于设置了参数 `--do_eval`,因此在训练完会自动进行评估。
可配置参数说明:
* `device`: 训练设备,可选择 'cpu'、'gpu'、'npu' 其中的一种;默认为 GPU 训练。
* `logging_steps`: 训练过程中日志打印的间隔 steps 数,默认10。
* `save_steps`: 训练过程中保存模型 checkpoint 的间隔 steps 数,默认100。
* `eval_steps`: 训练过程中保存模型 checkpoint 的间隔 steps 数,默认100。
* `seed`:全局随机种子,默认为 42。
* `model_name_or_path`:进行 few shot 训练使用的预训练模型。默认为 "uie-x-base"。
* `output_dir`:必须,模型训练或压缩后保存的模型目录;默认为 `None`
* `train_path`:训练集路径;默认为 `None`
* `dev_path`:开发集路径;默认为 `None`
* `max_seq_len`:文本最大切分长度,输入超过最大长度时会对输入文本进行自动切分,默认为512。
* `per_device_train_batch_size`:用于训练的每个 GPU 核心/NPU 核心/CPU 的batch大小,默认为8。
* `per_device_eval_batch_size`:用于评估的每个 GPU 核心/NPU 核心/CPU 的batch大小,默认为8。
* `num_train_epochs`: 训练轮次,使用早停法时可以选择 100;默认为10。
* `learning_rate`:训练最大学习率,UIE-X 推荐设置为 1e-5;默认值为3e-5。
* `label_names`:训练数据标签label的名称,UIE-X 设置为'start_positions' 'end_positions';默认值为None。
* `do_train`:是否进行微调训练,设置该参数表示进行微调训练,默认不设置。
* `do_eval`:是否进行评估,设置该参数表示进行评估,默认不设置。
* `do_export`:是否进行导出,设置该参数表示进行静态图导出,默认不设置。
* `export_model_dir`:静态图导出地址,默认为None。
* `overwrite_output_dir`: 如果 `True`,覆盖输出目录的内容。如果 `output_dir` 指向检查点目录,则使用它继续训练。
* `disable_tqdm`: 是否使用tqdm进度条。
* `metric_for_best_model`:最优模型指标,UIE-X 推荐设置为 `eval_f1`,默认为None。
* `load_best_model_at_end`:训练结束后是否加载最优模型,通常与`metric_for_best_model`配合使用,默认为False。
* `save_total_limit`:如果设置次参数,将限制checkpoint的总数。删除旧的checkpoints `输出目录`,默认为None。
<a name="模型评估"></a>
### 2.4 模型评估
```shell
python evaluate.py \
--device "gpu" \
--model_path ./checkpoint/model_best \
--test_path ./data/dev.txt \
--output_dir ./checkpoint/model_best \
--label_names 'start_positions' 'end_positions'\
--max_seq_len 512 \
--per_device_eval_batch_size 16
```
评估方式说明:采用单阶段评价的方式,即关系抽取、事件抽取等需要分阶段预测的任务对每一阶段的预测结果进行分别评价。验证/测试集默认会利用同一层级的所有标签来构造出全部负例。
可开启`debug`模式对每个正例类别分别进行评估,该模式仅用于模型调试:
```shell
python evaluate.py \
--device "gpu" \
--model_path ./checkpoint/model_best \
--test_path ./data/dev.txt \
--output_dir ./checkpoint/model_best \
--label_names 'start_positions' 'end_positions'\
--max_seq_len 512 \
--per_device_eval_batch_size 16 \
--debug True
```
输出结果:
```text
[2022-11-14 09:41:18,424] [ INFO] - ***** Running Evaluation *****
[2022-11-14 09:41:18,424] [ INFO] - Num examples = 160
[2022-11-14 09:41:18,424] [ INFO] - Pre device batch size = 4
[2022-11-14 09:41:18,424] [ INFO] - Total Batch size = 4
[2022-11-14 09:41:18,424] [ INFO] - Total prediction steps = 40
[2022-11-14 09:41:26,451] [ INFO] - -----Evaluate model-------
[2022-11-14 09:41:26,451] [ INFO] - Class Name: ALL CLASSES
[2022-11-14 09:41:26,451] [ INFO] - Evaluation Precision: 0.94521 | Recall: 0.88462 | F1: 0.91391
[2022-11-14 09:41:26,451] [ INFO] - -----------------------------
[2022-11-14 09:41:26,452] [ INFO] - ***** Running Evaluation *****
[2022-11-14 09:41:26,452] [ INFO] - Num examples = 8
[2022-11-14 09:41:26,452] [ INFO] - Pre device batch size = 4
[2022-11-14 09:41:26,452] [ INFO] - Total Batch size = 4
[2022-11-14 09:41:26,452] [ INFO] - Total prediction steps = 2
[2022-11-14 09:41:26,692] [ INFO] - Class Name: 开票日期
[2022-11-14 09:41:26,692] [ INFO] - Evaluation Precision: 1.00000 | Recall: 1.00000 | F1: 1.00000
[2022-11-14 09:41:26,692] [ INFO] - -----------------------------
[2022-11-14 09:41:26,693] [ INFO] - ***** Running Evaluation *****
[2022-11-14 09:41:26,693] [ INFO] - Num examples = 8
[2022-11-14 09:41:26,693] [ INFO] - Pre device batch size = 4
[2022-11-14 09:41:26,693] [ INFO] - Total Batch size = 4
[2022-11-14 09:41:26,693] [ INFO] - Total prediction steps = 2
[2022-11-14 09:41:26,952] [ INFO] - Class Name: 名称
[2022-11-14 09:41:26,952] [ INFO] - Evaluation Precision: 0.87500 | Recall: 0.87500 | F1: 0.87500
[2022-11-14 09:41:26,952] [ INFO] - -----------------------------
...
```
可配置参数:
* `device`: 评估设备,可选择 'cpu'、'gpu'、'npu' 其中的一种;默认为 GPU 评估。
* `model_path`: 进行评估的模型文件夹路径,路径下需包含模型权重文件`model_state.pdparams`及配置文件`model_config.json`
* `test_path`: 进行评估的测试集文件。
* `label_names`:训练数据标签label的名称,UIE-X 设置为'start_positions' 'end_positions';默认值为None。
* `batch_size`: 批处理大小,请结合机器情况进行调整,默认为16。
* `max_seq_len`: 文本最大切分长度,输入超过最大长度时会对输入文本进行自动切分,默认为512。
* `per_device_eval_batch_size`:用于评估的每个 GPU 核心/NPU 核心/CPU 的batch大小,默认为8。
* `debug`: 是否开启debug模式对每个正例类别分别进行评估,该模式仅用于模型调试,默认关闭。
* `schema_lang`: 选择schema的语言,可选有`ch``en`。默认为`ch`,英文数据集请选择`en`
<a name="定制模型一键预测"></a>
### 2.5 定制模型一键预测
`paddlenlp.Taskflow`装载定制模型,通过`task_path`指定模型权重文件的路径,路径下需要包含训练好的模型权重文件`model_state.pdparams`
```python
from pprint import pprint
from paddlenlp import Taskflow
from paddlenlp.utils.doc_parser import DocParser
schema = ['开票日期', '名称', '纳税人识别号', '开户行及账号', '金额', '价税合计', 'No', '税率', '地址、电话', '税额']
my_ie = Taskflow("information_extraction", model="uie-x-base", schema=schema, task_path='./checkpoint/model_best', precision='fp16')
```
我们可以根据设置的`schema`,对指定的`doc_path`文档进行信息抽取并进行可视化:
```python
doc_path = "./data/images/b199.jpg"
results = my_ie({"doc": doc_path})
pprint(results)
# 结果可视化
DocParser.write_image_with_results(
doc_path,
result=results[0],
save_path="./image_show.png")
```
<div align="center">
<img src=https://user-images.githubusercontent.com/40840292/206084942-44ba477c-9244-4ce2-bbb5-ba430c9b926e.png height=550 width=700 />
</div>
<a name="实验指标"></a>
### 2.6 实验指标
我们在自标注的增值税数据集上进行实验:
| | Precision | Recall | F1 Score |
| :---: | :--------: | :--------: | :--------: |
| 0-shot| 0.44898 | 0.56410 | 0.50000 |
| 5-shot| 0.9000 | 0.9231 | 0.9114 |
| 10-shot| 0.9125 | 0.93590 | 0.9241 |
| 20-shot| 0.9737 | 0.9487 | 0.9610 |
| 30-shot| 0.9744 | 0.9744 | 0.9744 |
| 30-shot+PP-Structure| 1.0 | 0.9625 | 0.9809 |
n-shot表示训练集包含n张标注图片数据进行模型微调,实验表明UIE-X可以通过少量数据(few-shot)和PP-Structure的布局分析进一步提升结果。
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment