Unverified Commit dd9d483d authored by Julien Chaumond's avatar Julien Chaumond Committed by GitHub
Browse files

Trainer (#3800)

* doc

* [tests] Add sample files for a regression task

* [HUGE] Trainer

* Feedback from @sshleifer

* Feedback from @thomwolf + logging tweak

* [file_utils] when downloading concurrently, get_from_cache will use the cached file for subsequent processes

* [glue] Use default max_seq_length of 128 like before

* [glue] move DataTrainingArguments around

* [ner] Change interface of InputExample, and align run_{tf,pl}

* Re-align the pl scripts a little bit

* ner

* [ner] Add integration test

* Fix language_modeling with API tweak

* [ci] Tweak loss target

* Don't break console output

* amp.initialize: model must be on right device before

* [multiple-choice] update for Trainer

* Re-align to 827d6d6e
parent eb5601b0
...@@ -130,6 +130,7 @@ proc_data ...@@ -130,6 +130,7 @@ proc_data
# examples # examples
runs runs
/runs_old
examples/runs examples/runs
# data # data
......
...@@ -306,8 +306,9 @@ setup your environment to run the examples. ...@@ -306,8 +306,9 @@ setup your environment to run the examples.
The library comprises several example scripts with SOTA performances for NLU and NLG tasks: The library comprises several example scripts with SOTA performances for NLU and NLG tasks:
- `run_glue.py`: an example fine-tuning Bert, XLNet and XLM on nine different GLUE tasks (*sequence-level classification*) - `run_glue.py`: an example fine-tuning sequence classification models on nine different GLUE tasks (*sequence-level classification*)
- `run_squad.py`: an example fine-tuning Bert, XLNet and XLM on the question answering dataset SQuAD 2.0 (*token-level classification*) - `run_squad.py`: an example fine-tuning question answering models on the question answering dataset SQuAD 2.0 (*token-level classification*)
- `run_ner.py`: an example fine-tuning token classification models on named entity recognition (*token-level classification*)
- `run_generation.py`: an example using GPT, GPT-2, CTRL, Transformer-XL and XLNet for conditional language generation - `run_generation.py`: an example using GPT, GPT-2, CTRL, Transformer-XL and XLNet for conditional language generation
- other model-specific examples (see the documentation). - other model-specific examples (see the documentation).
...@@ -317,7 +318,7 @@ Here are three quick usage examples for these scripts: ...@@ -317,7 +318,7 @@ Here are three quick usage examples for these scripts:
The [General Language Understanding Evaluation (GLUE) benchmark](https://gluebenchmark.com/) is a collection of nine sentence- or sentence-pair language understanding tasks for evaluating and analyzing natural language understanding systems. The [General Language Understanding Evaluation (GLUE) benchmark](https://gluebenchmark.com/) is a collection of nine sentence- or sentence-pair language understanding tasks for evaluating and analyzing natural language understanding systems.
Before running anyone of these GLUE tasks you should download the Before running any of these GLUE tasks you should download the
[GLUE data](https://gluebenchmark.com/tasks) by running [GLUE data](https://gluebenchmark.com/tasks) by running
[this script](https://gist.github.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e) [this script](https://gist.github.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e)
and unpack it to some directory `$GLUE_DIR`. and unpack it to some directory `$GLUE_DIR`.
...@@ -333,7 +334,6 @@ export GLUE_DIR=/path/to/glue ...@@ -333,7 +334,6 @@ export GLUE_DIR=/path/to/glue
export TASK_NAME=MRPC export TASK_NAME=MRPC
python ./examples/run_glue.py \ python ./examples/run_glue.py \
--model_type bert \
--model_name_or_path bert-base-uncased \ --model_name_or_path bert-base-uncased \
--task_name $TASK_NAME \ --task_name $TASK_NAME \
--do_train \ --do_train \
...@@ -360,7 +360,6 @@ Parallel training is a simple way to use several GPUs (but is slower and less fl ...@@ -360,7 +360,6 @@ Parallel training is a simple way to use several GPUs (but is slower and less fl
export GLUE_DIR=/path/to/glue export GLUE_DIR=/path/to/glue
python ./examples/run_glue.py \ python ./examples/run_glue.py \
--model_type xlnet \
--model_name_or_path xlnet-large-cased \ --model_name_or_path xlnet-large-cased \
--do_train \ --do_train \
--do_eval \ --do_eval \
...@@ -386,7 +385,6 @@ This example code fine-tunes the Bert Whole Word Masking model on the Microsoft ...@@ -386,7 +385,6 @@ This example code fine-tunes the Bert Whole Word Masking model on the Microsoft
```bash ```bash
python -m torch.distributed.launch --nproc_per_node 8 ./examples/run_glue.py \ python -m torch.distributed.launch --nproc_per_node 8 ./examples/run_glue.py \
--model_type bert \
--model_name_or_path bert-large-uncased-whole-word-masking \ --model_name_or_path bert-large-uncased-whole-word-masking \
--task_name MRPC \ --task_name MRPC \
--do_train \ --do_train \
......
...@@ -246,7 +246,6 @@ and unpack it to some directory `$GLUE_DIR`. ...@@ -246,7 +246,6 @@ and unpack it to some directory `$GLUE_DIR`.
export GLUE_DIR=/path/to/glue export GLUE_DIR=/path/to/glue
python run_glue.py \ python run_glue.py \
--model_type bert \
--model_name_or_path bert-base-cased \ --model_name_or_path bert-base-cased \
--task_name MRPC \ --task_name MRPC \
--do_train \ --do_train \
...@@ -272,7 +271,6 @@ Using Apex and 16 bit precision, the fine-tuning on MRPC only takes 27 seconds. ...@@ -272,7 +271,6 @@ Using Apex and 16 bit precision, the fine-tuning on MRPC only takes 27 seconds.
export GLUE_DIR=/path/to/glue export GLUE_DIR=/path/to/glue
python run_glue.py \ python run_glue.py \
--model_type bert \
--model_name_or_path bert-base-cased \ --model_name_or_path bert-base-cased \
--task_name MRPC \ --task_name MRPC \
--do_train \ --do_train \
...@@ -296,7 +294,6 @@ export GLUE_DIR=/path/to/glue ...@@ -296,7 +294,6 @@ export GLUE_DIR=/path/to/glue
python -m torch.distributed.launch \ python -m torch.distributed.launch \
--nproc_per_node 8 run_glue.py \ --nproc_per_node 8 run_glue.py \
--model_type bert \
--model_name_or_path bert-base-cased \ --model_name_or_path bert-base-cased \
--task_name MRPC \ --task_name MRPC \
--do_train \ --do_train \
...@@ -329,7 +326,6 @@ export GLUE_DIR=/path/to/glue ...@@ -329,7 +326,6 @@ export GLUE_DIR=/path/to/glue
python -m torch.distributed.launch \ python -m torch.distributed.launch \
--nproc_per_node 8 run_glue.py \ --nproc_per_node 8 run_glue.py \
--model_type bert \
--model_name_or_path bert-base-cased \ --model_name_or_path bert-base-cased \
--task_name mnli \ --task_name mnli \
--do_train \ --do_train \
...@@ -369,7 +365,6 @@ Download [swag](https://github.com/rowanz/swagaf/tree/master/data) data ...@@ -369,7 +365,6 @@ Download [swag](https://github.com/rowanz/swagaf/tree/master/data) data
#training on 4 tesla V100(16GB) GPUS #training on 4 tesla V100(16GB) GPUS
export SWAG_DIR=/path/to/swag_data_dir export SWAG_DIR=/path/to/swag_data_dir
python ./examples/run_multiple_choice.py \ python ./examples/run_multiple_choice.py \
--model_type roberta \
--task_name swag \ --task_name swag \
--model_name_or_path roberta-base \ --model_name_or_path roberta-base \
--do_train \ --do_train \
......
...@@ -11,7 +11,6 @@ export DATA_DIR=./glue_data/MRPC/ ...@@ -11,7 +11,6 @@ export DATA_DIR=./glue_data/MRPC/
export MAX_LENGTH=128 export MAX_LENGTH=128
export LEARNING_RATE=2e-5 export LEARNING_RATE=2e-5
export BERT_MODEL=bert-base-cased export BERT_MODEL=bert-base-cased
export MODEL_TYPE=bert
export BATCH_SIZE=32 export BATCH_SIZE=32
export NUM_EPOCHS=3 export NUM_EPOCHS=3
export SEED=2 export SEED=2
...@@ -25,7 +24,6 @@ mkdir -p $OUTPUT_DIR ...@@ -25,7 +24,6 @@ mkdir -p $OUTPUT_DIR
export PYTHONPATH="../":"${PYTHONPATH}" export PYTHONPATH="../":"${PYTHONPATH}"
python3 run_pl_glue.py --data_dir $DATA_DIR \ python3 run_pl_glue.py --data_dir $DATA_DIR \
--model_type $MODEL_TYPE \
--task $TASK \ --task $TASK \
--model_name_or_path $BERT_MODEL \ --model_name_or_path $BERT_MODEL \
--output_dir $OUTPUT_DIR \ --output_dir $OUTPUT_DIR \
......
...@@ -35,8 +35,8 @@ class GLUETransformer(BaseTransformer): ...@@ -35,8 +35,8 @@ class GLUETransformer(BaseTransformer):
def training_step(self, batch, batch_idx): def training_step(self, batch, batch_idx):
inputs = {"input_ids": batch[0], "attention_mask": batch[1], "labels": batch[3]} inputs = {"input_ids": batch[0], "attention_mask": batch[1], "labels": batch[3]}
if self.hparams.model_type != "distilbert": if self.config.model_type != "distilbert":
inputs["token_type_ids"] = batch[2] if self.hparams.model_type in ["bert", "xlnet", "albert"] else None inputs["token_type_ids"] = batch[2] if self.config.model_type in ["bert", "xlnet", "albert"] else None
outputs = self(**inputs) outputs = self(**inputs)
loss = outputs[0] loss = outputs[0]
...@@ -95,8 +95,8 @@ class GLUETransformer(BaseTransformer): ...@@ -95,8 +95,8 @@ class GLUETransformer(BaseTransformer):
def validation_step(self, batch, batch_idx): def validation_step(self, batch, batch_idx):
inputs = {"input_ids": batch[0], "attention_mask": batch[1], "labels": batch[3]} inputs = {"input_ids": batch[0], "attention_mask": batch[1], "labels": batch[3]}
if self.hparams.model_type != "distilbert": if self.config.model_type != "distilbert":
inputs["token_type_ids"] = batch[2] if self.hparams.model_type in ["bert", "xlnet", "albert"] else None inputs["token_type_ids"] = batch[2] if self.config.model_type in ["bert", "xlnet", "albert"] else None
outputs = self(**inputs) outputs = self(**inputs)
tmp_eval_loss, logits = outputs[:2] tmp_eval_loss, logits = outputs[:2]
...@@ -179,7 +179,7 @@ if __name__ == "__main__": ...@@ -179,7 +179,7 @@ if __name__ == "__main__":
# If output_dir not provided, a folder will be generated in pwd # If output_dir not provided, a folder will be generated in pwd
if args.output_dir is None: if args.output_dir is None:
args.output_dir = os.path.join("./results", f"{args.task}_{args.model_type}_{time.strftime('%Y%m%d_%H%M%S')}",) args.output_dir = os.path.join("./results", f"{args.task}_{time.strftime('%Y%m%d_%H%M%S')}",)
os.makedirs(args.output_dir) os.makedirs(args.output_dir)
model = GLUETransformer(args) model = GLUETransformer(args)
......
...@@ -64,7 +64,6 @@ To start training, just run: ...@@ -64,7 +64,6 @@ To start training, just run:
```bash ```bash
python3 run_ner.py --data_dir ./ \ python3 run_ner.py --data_dir ./ \
--model_type bert \
--labels ./labels.txt \ --labels ./labels.txt \
--model_name_or_path $BERT_MODEL \ --model_name_or_path $BERT_MODEL \
--output_dir $OUTPUT_DIR \ --output_dir $OUTPUT_DIR \
...@@ -125,7 +124,6 @@ To start training, just run: ...@@ -125,7 +124,6 @@ To start training, just run:
```bash ```bash
python3 run_tf_ner.py --data_dir ./ \ python3 run_tf_ner.py --data_dir ./ \
--model_type bert \
--labels ./labels.txt \ --labels ./labels.txt \
--model_name_or_path $BERT_MODEL \ --model_name_or_path $BERT_MODEL \
--output_dir $OUTPUT_DIR \ --output_dir $OUTPUT_DIR \
......
...@@ -4,7 +4,7 @@ curl -L 'https://sites.google.com/site/germeval2014ner/data/NER-de-dev.tsv?attre ...@@ -4,7 +4,7 @@ curl -L 'https://sites.google.com/site/germeval2014ner/data/NER-de-dev.tsv?attre
| grep -v "^#" | cut -f 2,3 | tr '\t' ' ' > dev.txt.tmp | grep -v "^#" | cut -f 2,3 | tr '\t' ' ' > dev.txt.tmp
curl -L 'https://sites.google.com/site/germeval2014ner/data/NER-de-test.tsv?attredirects=0&d=1' \ curl -L 'https://sites.google.com/site/germeval2014ner/data/NER-de-test.tsv?attredirects=0&d=1' \
| grep -v "^#" | cut -f 2,3 | tr '\t' ' ' > test.txt.tmp | grep -v "^#" | cut -f 2,3 | tr '\t' ' ' > test.txt.tmp
wget "https://raw.githubusercontent.com/stefan-it/fine-tuned-berts-seq/master/scripts/preprocess.py" wget "https://raw.githubusercontent.com/stefan-it/fine-tuned-berts-seq/master/scripts/preprocess.py"
export MAX_LENGTH=128 export MAX_LENGTH=128
export BERT_MODEL=bert-base-multilingual-cased export BERT_MODEL=bert-base-multilingual-cased
python3 preprocess.py train.txt.tmp $BERT_MODEL $MAX_LENGTH > train.txt python3 preprocess.py train.txt.tmp $BERT_MODEL $MAX_LENGTH > train.txt
...@@ -17,8 +17,8 @@ export NUM_EPOCHS=3 ...@@ -17,8 +17,8 @@ export NUM_EPOCHS=3
export SAVE_STEPS=750 export SAVE_STEPS=750
export SEED=1 export SEED=1
python3 run_ner.py --data_dir ./ \ python3 run_ner.py \
--model_type bert \ --data_dir . \
--labels ./labels.txt \ --labels ./labels.txt \
--model_name_or_path $BERT_MODEL \ --model_name_or_path $BERT_MODEL \
--output_dir $OUTPUT_DIR \ --output_dir $OUTPUT_DIR \
......
This diff is collapsed.
...@@ -11,7 +11,7 @@ curl -L 'https://sites.google.com/site/germeval2014ner/data/NER-de-dev.tsv?attre ...@@ -11,7 +11,7 @@ curl -L 'https://sites.google.com/site/germeval2014ner/data/NER-de-dev.tsv?attre
| grep -v "^#" | cut -f 2,3 | tr '\t' ' ' > dev.txt.tmp | grep -v "^#" | cut -f 2,3 | tr '\t' ' ' > dev.txt.tmp
curl -L 'https://sites.google.com/site/germeval2014ner/data/NER-de-test.tsv?attredirects=0&d=1' \ curl -L 'https://sites.google.com/site/germeval2014ner/data/NER-de-test.tsv?attredirects=0&d=1' \
| grep -v "^#" | cut -f 2,3 | tr '\t' ' ' > test.txt.tmp | grep -v "^#" | cut -f 2,3 | tr '\t' ' ' > test.txt.tmp
wget "https://raw.githubusercontent.com/stefan-it/fine-tuned-berts-seq/master/scripts/preprocess.py" wget "https://raw.githubusercontent.com/stefan-it/fine-tuned-berts-seq/master/scripts/preprocess.py"
export MAX_LENGTH=128 export MAX_LENGTH=128
export BERT_MODEL=bert-base-multilingual-cased export BERT_MODEL=bert-base-multilingual-cased
python3 preprocess.py train.txt.tmp $BERT_MODEL $MAX_LENGTH > train.txt python3 preprocess.py train.txt.tmp $BERT_MODEL $MAX_LENGTH > train.txt
......
...@@ -27,7 +27,7 @@ class NERTransformer(BaseTransformer): ...@@ -27,7 +27,7 @@ class NERTransformer(BaseTransformer):
self.labels = get_labels(hparams.labels) self.labels = get_labels(hparams.labels)
num_labels = len(self.labels) num_labels = len(self.labels)
self.pad_token_label_id = CrossEntropyLoss().ignore_index self.pad_token_label_id = CrossEntropyLoss().ignore_index
super(NERTransformer, self).__init__(hparams, num_labels, self.mode) super().__init__(hparams, num_labels, self.mode)
def forward(self, **inputs): def forward(self, **inputs):
return self.model(**inputs) return self.model(**inputs)
...@@ -35,10 +35,10 @@ class NERTransformer(BaseTransformer): ...@@ -35,10 +35,10 @@ class NERTransformer(BaseTransformer):
def training_step(self, batch, batch_num): def training_step(self, batch, batch_num):
"Compute loss and log." "Compute loss and log."
inputs = {"input_ids": batch[0], "attention_mask": batch[1], "labels": batch[3]} inputs = {"input_ids": batch[0], "attention_mask": batch[1], "labels": batch[3]}
if self.hparams.model_type != "distilbert": if self.config.model_type != "distilbert":
inputs["token_type_ids"] = ( inputs["token_type_ids"] = (
batch[2] if self.hparams.model_type in ["bert", "xlnet"] else None batch[2] if self.config.model_type in ["bert", "xlnet"] else None
) # XLM and RoBERTa don"t use segment_ids ) # XLM and RoBERTa don"t use token_type_ids
outputs = self(**inputs) outputs = self(**inputs)
loss = outputs[0] loss = outputs[0]
...@@ -58,12 +58,12 @@ class NERTransformer(BaseTransformer): ...@@ -58,12 +58,12 @@ class NERTransformer(BaseTransformer):
self.labels, self.labels,
args.max_seq_length, args.max_seq_length,
self.tokenizer, self.tokenizer,
cls_token_at_end=bool(args.model_type in ["xlnet"]), cls_token_at_end=bool(self.config.model_type in ["xlnet"]),
cls_token=self.tokenizer.cls_token, cls_token=self.tokenizer.cls_token,
cls_token_segment_id=2 if args.model_type in ["xlnet"] else 0, cls_token_segment_id=2 if self.config.model_type in ["xlnet"] else 0,
sep_token=self.tokenizer.sep_token, sep_token=self.tokenizer.sep_token,
sep_token_extra=bool(args.model_type in ["roberta"]), sep_token_extra=bool(self.config.model_type in ["roberta"]),
pad_on_left=bool(args.model_type in ["xlnet"]), pad_on_left=bool(self.config.model_type in ["xlnet"]),
pad_token=self.tokenizer.pad_token_id, pad_token=self.tokenizer.pad_token_id,
pad_token_segment_id=self.tokenizer.pad_token_type_id, pad_token_segment_id=self.tokenizer.pad_token_type_id,
pad_token_label_id=self.pad_token_label_id, pad_token_label_id=self.pad_token_label_id,
...@@ -77,21 +77,25 @@ class NERTransformer(BaseTransformer): ...@@ -77,21 +77,25 @@ class NERTransformer(BaseTransformer):
logger.info("Loading features from cached file %s", cached_features_file) logger.info("Loading features from cached file %s", cached_features_file)
features = torch.load(cached_features_file) features = torch.load(cached_features_file)
all_input_ids = torch.tensor([f.input_ids for f in features], dtype=torch.long) all_input_ids = torch.tensor([f.input_ids for f in features], dtype=torch.long)
all_input_mask = torch.tensor([f.input_mask for f in features], dtype=torch.long) all_attention_mask = torch.tensor([f.attention_mask for f in features], dtype=torch.long)
all_segment_ids = torch.tensor([f.segment_ids for f in features], dtype=torch.long) if features[0].token_type_ids is not None:
all_token_type_ids = torch.tensor([f.token_type_ids for f in features], dtype=torch.long)
else:
all_token_type_ids = torch.tensor([0 for f in features], dtype=torch.long)
# HACK(we will not use this anymore soon)
all_label_ids = torch.tensor([f.label_ids for f in features], dtype=torch.long) all_label_ids = torch.tensor([f.label_ids for f in features], dtype=torch.long)
return DataLoader( return DataLoader(
TensorDataset(all_input_ids, all_input_mask, all_segment_ids, all_label_ids), batch_size=batch_size TensorDataset(all_input_ids, all_attention_mask, all_token_type_ids, all_label_ids), batch_size=batch_size
) )
def validation_step(self, batch, batch_nb): def validation_step(self, batch, batch_nb):
"Compute validation" "Compute validation"
inputs = {"input_ids": batch[0], "attention_mask": batch[1], "labels": batch[3]} inputs = {"input_ids": batch[0], "attention_mask": batch[1], "labels": batch[3]}
if self.hparams.model_type != "distilbert": if self.config.model_type != "distilbert":
inputs["token_type_ids"] = ( inputs["token_type_ids"] = (
batch[2] if self.hparams.model_type in ["bert", "xlnet"] else None batch[2] if self.config.model_type in ["bert", "xlnet"] else None
) # XLM and RoBERTa don"t use segment_ids ) # XLM and RoBERTa don"t use token_type_ids
outputs = self(**inputs) outputs = self(**inputs)
tmp_eval_loss, logits = outputs[:2] tmp_eval_loss, logits = outputs[:2]
preds = logits.detach().cpu().numpy() preds = logits.detach().cpu().numpy()
......
...@@ -9,6 +9,7 @@ import re ...@@ -9,6 +9,7 @@ import re
import numpy as np import numpy as np
import tensorflow as tf import tensorflow as tf
from absl import app, flags, logging from absl import app, flags, logging
from fastprogress import master_bar, progress_bar
from seqeval import metrics from seqeval import metrics
from transformers import ( from transformers import (
...@@ -17,34 +18,23 @@ from transformers import ( ...@@ -17,34 +18,23 @@ from transformers import (
AutoConfig, AutoConfig,
AutoTokenizer, AutoTokenizer,
GradientAccumulator, GradientAccumulator,
PreTrainedTokenizer,
TFAutoModelForTokenClassification, TFAutoModelForTokenClassification,
create_optimizer, create_optimizer,
) )
from utils_ner import convert_examples_to_features, get_labels, read_examples_from_file from utils_ner import convert_examples_to_features, get_labels, read_examples_from_file
try:
from fastprogress import master_bar, progress_bar
except ImportError:
from fastprogress.fastprogress import master_bar, progress_bar
MODEL_CONFIG_CLASSES = list(TF_MODEL_FOR_TOKEN_CLASSIFICATION_MAPPING.keys()) MODEL_CONFIG_CLASSES = list(TF_MODEL_FOR_TOKEN_CLASSIFICATION_MAPPING.keys())
MODEL_TYPES = tuple(conf.model_type for conf in MODEL_CONFIG_CLASSES) MODEL_TYPES = tuple(conf.model_type for conf in MODEL_CONFIG_CLASSES)
ALL_MODELS = sum((tuple(conf.pretrained_config_archive_map.keys()) for conf in MODEL_CONFIG_CLASSES), (),)
flags.DEFINE_string( flags.DEFINE_string(
"data_dir", None, "The input data dir. Should contain the .conll files (or other data files) " "for the task." "data_dir", None, "The input data dir. Should contain the .conll files (or other data files) for the task."
) )
flags.DEFINE_string("model_type", None, "Model type selected in the list: " + ", ".join(MODEL_TYPES))
flags.DEFINE_string( flags.DEFINE_string(
"model_name_or_path", "model_name_or_path", None, "Path to pretrained model or model identifier from huggingface.co/models",
None,
"Path to pre-trained model or shortcut name selected in the list: " + ", ".join(ALL_MODELS),
) )
flags.DEFINE_string("output_dir", None, "The output directory where the model checkpoints will be written.") flags.DEFINE_string("output_dir", None, "The output directory where the model checkpoints will be written.")
...@@ -53,11 +43,11 @@ flags.DEFINE_string( ...@@ -53,11 +43,11 @@ flags.DEFINE_string(
"labels", "", "Path to a file containing all labels. If not specified, CoNLL-2003 labels are used." "labels", "", "Path to a file containing all labels. If not specified, CoNLL-2003 labels are used."
) )
flags.DEFINE_string("config_name", "", "Pretrained config name or path if not the same as model_name") flags.DEFINE_string("config_name", None, "Pretrained config name or path if not the same as model_name")
flags.DEFINE_string("tokenizer_name", "", "Pretrained tokenizer name or path if not the same as model_name") flags.DEFINE_string("tokenizer_name", None, "Pretrained tokenizer name or path if not the same as model_name")
flags.DEFINE_string("cache_dir", "", "Where do you want to store the pre-trained models downloaded from s3") flags.DEFINE_string("cache_dir", None, "Where do you want to store the pre-trained models downloaded from s3")
flags.DEFINE_integer( flags.DEFINE_integer(
"max_seq_length", "max_seq_length",
...@@ -123,7 +113,7 @@ flags.DEFINE_boolean( ...@@ -123,7 +113,7 @@ flags.DEFINE_boolean(
"Evaluate all checkpoints starting with the same prefix as model_name ending and ending with step number", "Evaluate all checkpoints starting with the same prefix as model_name ending and ending with step number",
) )
flags.DEFINE_boolean("no_cuda", False, "Avoid using CUDA when available") flags.DEFINE_boolean("no_cuda", False, "Avoid using CUDA even if it is available")
flags.DEFINE_boolean("overwrite_output_dir", False, "Overwrite the content of the output directory") flags.DEFINE_boolean("overwrite_output_dir", False, "Overwrite the content of the output directory")
...@@ -198,12 +188,10 @@ def train( ...@@ -198,12 +188,10 @@ def train(
@tf.function @tf.function
def train_step(train_features, train_labels): def train_step(train_features, train_labels):
def step_fn(train_features, train_labels): def step_fn(train_features, train_labels):
inputs = {"attention_mask": train_features["input_mask"], "training": True} inputs = {"attention_mask": train_features["attention_mask"], "training": True}
if args["model_type"] != "distilbert": if "token_type_ids" in train_features:
inputs["token_type_ids"] = ( inputs["token_type_ids"] = train_features["token_type_ids"]
train_features["segment_ids"] if args["model_type"] in ["bert", "xlnet"] else None
)
with tf.GradientTape() as tape: with tf.GradientTape() as tape:
logits = model(train_features["input_ids"], **inputs)[0] logits = model(train_features["input_ids"], **inputs)[0]
...@@ -320,12 +308,10 @@ def evaluate(args, strategy, model, tokenizer, labels, pad_token_label_id, mode) ...@@ -320,12 +308,10 @@ def evaluate(args, strategy, model, tokenizer, labels, pad_token_label_id, mode)
logging.info(" Batch size = %d", eval_batch_size) logging.info(" Batch size = %d", eval_batch_size)
for eval_features, eval_labels in eval_iterator: for eval_features, eval_labels in eval_iterator:
inputs = {"attention_mask": eval_features["input_mask"], "training": False} inputs = {"attention_mask": eval_features["attention_mask"], "training": False}
if args["model_type"] != "distilbert": if "token_type_ids" in eval_features:
inputs["token_type_ids"] = ( inputs["token_type_ids"] = eval_features["token_type_ids"]
eval_features["segment_ids"] if args["model_type"] in ["bert", "xlnet"] else None
)
with strategy.scope(): with strategy.scope():
logits = model(eval_features["input_ids"], **inputs)[0] logits = model(eval_features["input_ids"], **inputs)[0]
...@@ -356,20 +342,23 @@ def evaluate(args, strategy, model, tokenizer, labels, pad_token_label_id, mode) ...@@ -356,20 +342,23 @@ def evaluate(args, strategy, model, tokenizer, labels, pad_token_label_id, mode)
return y_true, y_pred, loss.numpy() return y_true, y_pred, loss.numpy()
def load_cache(cached_file, max_seq_length): def load_cache(cached_file, tokenizer: PreTrainedTokenizer, max_seq_length):
name_to_features = { name_to_features = {
"input_ids": tf.io.FixedLenFeature([max_seq_length], tf.int64), "input_ids": tf.io.FixedLenFeature([max_seq_length], tf.int64),
"input_mask": tf.io.FixedLenFeature([max_seq_length], tf.int64), "attention_mask": tf.io.FixedLenFeature([max_seq_length], tf.int64),
"segment_ids": tf.io.FixedLenFeature([max_seq_length], tf.int64),
"label_ids": tf.io.FixedLenFeature([max_seq_length], tf.int64), "label_ids": tf.io.FixedLenFeature([max_seq_length], tf.int64),
} }
# TODO Find a cleaner way to do this.
if "token_type_ids" in tokenizer.model_input_names:
name_to_features["token_type_ids"] = tf.io.FixedLenFeature([max_seq_length], tf.int64)
def _decode_record(record): def _decode_record(record):
example = tf.io.parse_single_example(record, name_to_features) example = tf.io.parse_single_example(record, name_to_features)
features = {} features = {}
features["input_ids"] = example["input_ids"] features["input_ids"] = example["input_ids"]
features["input_mask"] = example["input_mask"] features["attention_mask"] = example["attention_mask"]
features["segment_ids"] = example["segment_ids"] if "token_type_ids" in example:
features["token_type_ids"] = example["token_type_ids"]
return features, example["label_ids"] return features, example["label_ids"]
...@@ -393,8 +382,9 @@ def save_cache(features, cached_features_file): ...@@ -393,8 +382,9 @@ def save_cache(features, cached_features_file):
record_feature = collections.OrderedDict() record_feature = collections.OrderedDict()
record_feature["input_ids"] = create_int_feature(feature.input_ids) record_feature["input_ids"] = create_int_feature(feature.input_ids)
record_feature["input_mask"] = create_int_feature(feature.input_mask) record_feature["attention_mask"] = create_int_feature(feature.attention_mask)
record_feature["segment_ids"] = create_int_feature(feature.segment_ids) if feature.token_type_ids is not None:
record_feature["token_type_ids"] = create_int_feature(feature.token_type_ids)
record_feature["label_ids"] = create_int_feature(feature.label_ids) record_feature["label_ids"] = create_int_feature(feature.label_ids)
tf_example = tf.train.Example(features=tf.train.Features(feature=record_feature)) tf_example = tf.train.Example(features=tf.train.Features(feature=record_feature))
...@@ -410,13 +400,11 @@ def load_and_cache_examples(args, tokenizer, labels, pad_token_label_id, batch_s ...@@ -410,13 +400,11 @@ def load_and_cache_examples(args, tokenizer, labels, pad_token_label_id, batch_s
# Load data features from cache or dataset file # Load data features from cache or dataset file
cached_features_file = os.path.join( cached_features_file = os.path.join(
args["data_dir"], args["data_dir"],
"cached_{}_{}_{}.tf_record".format( "cached_{}_{}_{}.tf_record".format(mode, tokenizer.__class__.__name__, str(args["max_seq_length"])),
mode, list(filter(None, args["model_name_or_path"].split("/"))).pop(), str(args["max_seq_length"])
),
) )
if os.path.exists(cached_features_file) and not args["overwrite_cache"]: if os.path.exists(cached_features_file) and not args["overwrite_cache"]:
logging.info("Loading features from cached file %s", cached_features_file) logging.info("Loading features from cached file %s", cached_features_file)
dataset, size = load_cache(cached_features_file, args["max_seq_length"]) dataset, size = load_cache(cached_features_file, tokenizer, args["max_seq_length"])
else: else:
logging.info("Creating features from dataset file at %s", args["data_dir"]) logging.info("Creating features from dataset file at %s", args["data_dir"])
examples = read_examples_from_file(args["data_dir"], mode) examples = read_examples_from_file(args["data_dir"], mode)
...@@ -440,7 +428,7 @@ def load_and_cache_examples(args, tokenizer, labels, pad_token_label_id, batch_s ...@@ -440,7 +428,7 @@ def load_and_cache_examples(args, tokenizer, labels, pad_token_label_id, batch_s
) )
logging.info("Saving features into cached file %s", cached_features_file) logging.info("Saving features into cached file %s", cached_features_file)
save_cache(features, cached_features_file) save_cache(features, cached_features_file)
dataset, size = load_cache(cached_features_file, args["max_seq_length"]) dataset, size = load_cache(cached_features_file, tokenizer, args["max_seq_length"])
if mode == "train": if mode == "train":
dataset = dataset.repeat() dataset = dataset.repeat()
...@@ -500,17 +488,18 @@ def main(_): ...@@ -500,17 +488,18 @@ def main(_):
config = AutoConfig.from_pretrained( config = AutoConfig.from_pretrained(
args["config_name"] if args["config_name"] else args["model_name_or_path"], args["config_name"] if args["config_name"] else args["model_name_or_path"],
num_labels=num_labels, num_labels=num_labels,
cache_dir=args["cache_dir"] if args["cache_dir"] else None, cache_dir=args["cache_dir"],
) )
logging.info("Training/evaluation parameters %s", args) logging.info("Training/evaluation parameters %s", args)
args["model_type"] = config.model_type
# Training # Training
if args["do_train"]: if args["do_train"]:
tokenizer = AutoTokenizer.from_pretrained( tokenizer = AutoTokenizer.from_pretrained(
args["tokenizer_name"] if args["tokenizer_name"] else args["model_name_or_path"], args["tokenizer_name"] if args["tokenizer_name"] else args["model_name_or_path"],
do_lower_case=args["do_lower_case"], do_lower_case=args["do_lower_case"],
cache_dir=args["cache_dir"] if args["cache_dir"] else None, cache_dir=args["cache_dir"],
) )
with strategy.scope(): with strategy.scope():
...@@ -518,7 +507,7 @@ def main(_): ...@@ -518,7 +507,7 @@ def main(_):
args["model_name_or_path"], args["model_name_or_path"],
from_pt=bool(".bin" in args["model_name_or_path"]), from_pt=bool(".bin" in args["model_name_or_path"]),
config=config, config=config,
cache_dir=args["cache_dir"] if args["cache_dir"] else None, cache_dir=args["cache_dir"],
) )
train_batch_size = args["per_device_train_batch_size"] * args["n_device"] train_batch_size = args["per_device_train_batch_size"] * args["n_device"]
...@@ -538,8 +527,7 @@ def main(_): ...@@ -538,8 +527,7 @@ def main(_):
pad_token_label_id, pad_token_label_id,
) )
if not os.path.exists(args["output_dir"]): os.makedirs(args["output_dir"], exist_ok=True)
os.makedirs(args["output_dir"])
logging.info("Saving model to %s", args["output_dir"]) logging.info("Saving model to %s", args["output_dir"])
...@@ -637,5 +625,4 @@ if __name__ == "__main__": ...@@ -637,5 +625,4 @@ if __name__ == "__main__":
flags.mark_flag_as_required("data_dir") flags.mark_flag_as_required("data_dir")
flags.mark_flag_as_required("output_dir") flags.mark_flag_as_required("output_dir")
flags.mark_flag_as_required("model_name_or_path") flags.mark_flag_as_required("model_name_or_path")
flags.mark_flag_as_required("model_type")
app.run(main) app.run(main)
import logging
import sys
import unittest
from unittest.mock import patch
import run_ner
logging.basicConfig(level=logging.DEBUG)
logger = logging.getLogger()
class ExamplesTests(unittest.TestCase):
def test_run_ner(self):
stream_handler = logging.StreamHandler(sys.stdout)
logger.addHandler(stream_handler)
testargs = """
--model_name distilbert-base-german-cased
--output_dir ./examples/tests_samples/temp_dir
--overwrite_output_dir
--data_dir ./examples/tests_samples/GermEval
--labels ./examples/tests_samples/GermEval/labels.txt
--max_seq_length 128
--num_train_epochs 6
--logging_steps 1
--do_train
--do_eval
""".split()
with patch.object(sys, "argv", ["run.py"] + testargs):
result = run_ner.main()
self.assertLess(result["loss"], 1.5)
...@@ -18,40 +18,126 @@ ...@@ -18,40 +18,126 @@
import logging import logging
import os import os
from dataclasses import dataclass
from enum import Enum
from typing import List, Optional, Union
import torch
from torch import nn
from torch.utils.data.dataset import Dataset
from transformers import PreTrainedTokenizer, torch_distributed_zero_first
logger = logging.getLogger(__name__) logger = logging.getLogger(__name__)
class InputExample(object): @dataclass
"""A single training/test example for token classification.""" class InputExample:
"""
A single training/test example for token classification.
def __init__(self, guid, words, labels): Args:
"""Constructs a InputExample. guid: Unique id for the example.
words: list. The words of the sequence.
labels: (Optional) list. The labels for each word of the sequence. This should be
specified for train and dev examples, but not for test examples.
"""
Args: guid: str
guid: Unique id for the example. words: List[str]
words: list. The words of the sequence. labels: Optional[List[str]]
labels: (Optional) list. The labels for each word of the sequence. This should be
specified for train and dev examples, but not for test examples.
"""
self.guid = guid
self.words = words
self.labels = labels
class InputFeatures(object): @dataclass
"""A single set of features of data.""" class InputFeatures:
"""
A single set of features of data.
Property names are the same names as the corresponding inputs to a model.
"""
def __init__(self, input_ids, input_mask, segment_ids, label_ids): input_ids: List[int]
self.input_ids = input_ids attention_mask: List[int]
self.input_mask = input_mask token_type_ids: Optional[List[int]] = None
self.segment_ids = segment_ids label_ids: Optional[List[int]] = None
self.label_ids = label_ids
def read_examples_from_file(data_dir, mode): class Split(Enum):
file_path = os.path.join(data_dir, "{}.txt".format(mode)) train = "train"
dev = "dev"
test = "test"
class NerDataset(Dataset):
"""
This will be superseded by a framework-agnostic approach
soon.
"""
features: List[InputFeatures]
pad_token_label_id: int = nn.CrossEntropyLoss().ignore_index
# Use cross entropy ignore_index as padding label id so that only
# real label ids contribute to the loss later.
def __init__(
self,
data_dir: str,
tokenizer: PreTrainedTokenizer,
labels: List[str],
model_type: str,
max_seq_length: Optional[int] = None,
overwrite_cache=False,
mode: Split = Split.train,
local_rank=-1,
):
# Load data features from cache or dataset file
cached_features_file = os.path.join(
data_dir, "cached_{}_{}_{}".format(mode.value, tokenizer.__class__.__name__, str(max_seq_length)),
)
with torch_distributed_zero_first(local_rank):
# Make sure only the first process in distributed training processes the dataset,
# and the others will use the cache.
if os.path.exists(cached_features_file) and not overwrite_cache:
logger.info(f"Loading features from cached file {cached_features_file}")
self.features = torch.load(cached_features_file)
else:
logger.info(f"Creating features from dataset file at {data_dir}")
examples = read_examples_from_file(data_dir, mode)
# TODO clean up all this to leverage built-in features of tokenizers
self.features = convert_examples_to_features(
examples,
labels,
max_seq_length,
tokenizer,
cls_token_at_end=bool(model_type in ["xlnet"]),
# xlnet has a cls token at the end
cls_token=tokenizer.cls_token,
cls_token_segment_id=2 if model_type in ["xlnet"] else 0,
sep_token=tokenizer.sep_token,
sep_token_extra=bool(model_type in ["roberta"]),
# roberta uses an extra separator b/w pairs of sentences, cf. github.com/pytorch/fairseq/commit/1684e166e3da03f5b600dbb7855cb98ddfcd0805
pad_on_left=bool(tokenizer.padding_side == "left"),
pad_token=tokenizer.pad_token_id,
pad_token_segment_id=tokenizer.pad_token_type_id,
pad_token_label_id=self.pad_token_label_id,
)
if local_rank in [-1, 0]:
logger.info(f"Saving features into cached file {cached_features_file}")
torch.save(self.features, cached_features_file)
def __len__(self):
return len(self.features)
def __getitem__(self, i) -> InputFeatures:
return self.features[i]
def read_examples_from_file(data_dir, mode: Union[Split, str]) -> List[InputExample]:
if isinstance(mode, Split):
mode = mode.value
file_path = os.path.join(data_dir, f"{mode}.txt")
guid_index = 1 guid_index = 1
examples = [] examples = []
with open(file_path, encoding="utf-8") as f: with open(file_path, encoding="utf-8") as f:
...@@ -60,7 +146,7 @@ def read_examples_from_file(data_dir, mode): ...@@ -60,7 +146,7 @@ def read_examples_from_file(data_dir, mode):
for line in f: for line in f:
if line.startswith("-DOCSTART-") or line == "" or line == "\n": if line.startswith("-DOCSTART-") or line == "" or line == "\n":
if words: if words:
examples.append(InputExample(guid="{}-{}".format(mode, guid_index), words=words, labels=labels)) examples.append(InputExample(guid=f"{mode}-{guid_index}", words=words, labels=labels))
guid_index += 1 guid_index += 1
words = [] words = []
labels = [] labels = []
...@@ -73,15 +159,15 @@ def read_examples_from_file(data_dir, mode): ...@@ -73,15 +159,15 @@ def read_examples_from_file(data_dir, mode):
# Examples could have no label for mode = "test" # Examples could have no label for mode = "test"
labels.append("O") labels.append("O")
if words: if words:
examples.append(InputExample(guid="{}-{}".format(mode, guid_index), words=words, labels=labels)) examples.append(InputExample(guid=f"{mode}-{guid_index}", words=words, labels=labels))
return examples return examples
def convert_examples_to_features( def convert_examples_to_features(
examples, examples: List[InputExample],
label_list, label_list: List[str],
max_seq_length, max_seq_length: int,
tokenizer, tokenizer: PreTrainedTokenizer,
cls_token_at_end=False, cls_token_at_end=False,
cls_token="[CLS]", cls_token="[CLS]",
cls_token_segment_id=1, cls_token_segment_id=1,
...@@ -93,19 +179,20 @@ def convert_examples_to_features( ...@@ -93,19 +179,20 @@ def convert_examples_to_features(
pad_token_label_id=-100, pad_token_label_id=-100,
sequence_a_segment_id=0, sequence_a_segment_id=0,
mask_padding_with_zero=True, mask_padding_with_zero=True,
): ) -> List[InputFeatures]:
""" Loads a data file into a list of `InputBatch`s """ Loads a data file into a list of `InputFeatures`
`cls_token_at_end` define the location of the CLS token: `cls_token_at_end` define the location of the CLS token:
- False (Default, BERT/XLM pattern): [CLS] + A + [SEP] + B + [SEP] - False (Default, BERT/XLM pattern): [CLS] + A + [SEP] + B + [SEP]
- True (XLNet/GPT pattern): A + [SEP] + B + [SEP] + [CLS] - True (XLNet/GPT pattern): A + [SEP] + B + [SEP] + [CLS]
`cls_token_segment_id` define the segment id associated to the CLS token (0 for BERT, 2 for XLNet) `cls_token_segment_id` define the segment id associated to the CLS token (0 for BERT, 2 for XLNet)
""" """
# TODO clean up all this to leverage built-in features of tokenizers
label_map = {label: i for i, label in enumerate(label_list)} label_map = {label: i for i, label in enumerate(label_list)}
features = [] features = []
for (ex_index, example) in enumerate(examples): for (ex_index, example) in enumerate(examples):
if ex_index % 10000 == 0: if ex_index % 10_000 == 0:
logger.info("Writing example %d of %d", ex_index, len(examples)) logger.info("Writing example %d of %d", ex_index, len(examples))
tokens = [] tokens = []
...@@ -120,7 +207,7 @@ def convert_examples_to_features( ...@@ -120,7 +207,7 @@ def convert_examples_to_features(
label_ids.extend([label_map[label]] + [pad_token_label_id] * (len(word_tokens) - 1)) label_ids.extend([label_map[label]] + [pad_token_label_id] * (len(word_tokens) - 1))
# Account for [CLS] and [SEP] with "- 2" and with "- 3" for RoBERTa. # Account for [CLS] and [SEP] with "- 2" and with "- 3" for RoBERTa.
special_tokens_count = tokenizer.num_added_tokens() special_tokens_count = tokenizer.num_special_tokens_to_add()
if len(tokens) > max_seq_length - special_tokens_count: if len(tokens) > max_seq_length - special_tokens_count:
tokens = tokens[: (max_seq_length - special_tokens_count)] tokens = tokens[: (max_seq_length - special_tokens_count)]
label_ids = label_ids[: (max_seq_length - special_tokens_count)] label_ids = label_ids[: (max_seq_length - special_tokens_count)]
...@@ -193,13 +280,18 @@ def convert_examples_to_features( ...@@ -193,13 +280,18 @@ def convert_examples_to_features(
logger.info("segment_ids: %s", " ".join([str(x) for x in segment_ids])) logger.info("segment_ids: %s", " ".join([str(x) for x in segment_ids]))
logger.info("label_ids: %s", " ".join([str(x) for x in label_ids])) logger.info("label_ids: %s", " ".join([str(x) for x in label_ids]))
if "token_type_ids" not in tokenizer.model_input_names:
segment_ids = None
features.append( features.append(
InputFeatures(input_ids=input_ids, input_mask=input_mask, segment_ids=segment_ids, label_ids=label_ids) InputFeatures(
input_ids=input_ids, attention_mask=input_mask, token_type_ids=segment_ids, label_ids=label_ids
)
) )
return features return features
def get_labels(path): def get_labels(path: str) -> List[str]:
if path: if path:
with open(path, "r") as f: with open(path, "r") as f:
labels = f.read().splitlines() labels = f.read().splitlines()
......
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
...@@ -159,7 +159,7 @@ def main(args): ...@@ -159,7 +159,7 @@ def main(args):
# If output_dir not provided, a folder will be generated in pwd # If output_dir not provided, a folder will be generated in pwd
if not args.output_dir: if not args.output_dir:
args.output_dir = os.path.join("./results", f"{args.task}_{args.model_type}_{time.strftime('%Y%m%d_%H%M%S')}",) args.output_dir = os.path.join("./results", f"{args.task}_{time.strftime('%Y%m%d_%H%M%S')}",)
os.makedirs(args.output_dir) os.makedirs(args.output_dir)
model = SummarizationTrainer(args) model = SummarizationTrainer(args)
trainer = generic_train(model, args) trainer = generic_train(model, args)
......
...@@ -10,7 +10,6 @@ export PYTHONPATH="../../":"${PYTHONPATH}" ...@@ -10,7 +10,6 @@ export PYTHONPATH="../../":"${PYTHONPATH}"
python finetune.py \ python finetune.py \
--data_dir=./cnn-dailymail/cnn_dm \ --data_dir=./cnn-dailymail/cnn_dm \
--model_type=bart \
--model_name_or_path=bart-large \ --model_name_or_path=bart-large \
--learning_rate=3e-5 \ --learning_rate=3e-5 \
--train_batch_size=4 \ --train_batch_size=4 \
......
...@@ -22,6 +22,7 @@ from unittest.mock import patch ...@@ -22,6 +22,7 @@ from unittest.mock import patch
import run_generation import run_generation
import run_glue import run_glue
import run_language_modeling
import run_squad import run_squad
...@@ -56,13 +57,38 @@ class ExamplesTests(unittest.TestCase): ...@@ -56,13 +57,38 @@ class ExamplesTests(unittest.TestCase):
"--warmup_steps=2", "--warmup_steps=2",
"--overwrite_output_dir", "--overwrite_output_dir",
"--seed=42", "--seed=42",
"--max_seq_length=128",
] ]
model_type, model_name = ("--model_type=bert", "--model_name_or_path=bert-base-uncased") model_name = "--model_name_or_path=bert-base-uncased"
with patch.object(sys, "argv", testargs + [model_type, model_name]): with patch.object(sys, "argv", testargs + [model_name]):
result = run_glue.main() result = run_glue.main()
del result["loss"]
for value in result.values(): for value in result.values():
self.assertGreaterEqual(value, 0.75) self.assertGreaterEqual(value, 0.75)
def test_run_language_modeling(self):
stream_handler = logging.StreamHandler(sys.stdout)
logger.addHandler(stream_handler)
testargs = """
run_language_modeling.py
--model_name_or_path distilroberta-base
--model_type roberta
--mlm
--line_by_line
--train_data_file ./tests/fixtures/sample_text.txt
--eval_data_file ./tests/fixtures/sample_text.txt
--output_dir ./tests/fixtures
--overwrite_output_dir
--do_train
--do_eval
--num_train_epochs=1
--no_cuda
""".split()
with patch.object(sys, "argv", testargs):
result = run_language_modeling.main()
self.assertLess(result["perplexity"], 35)
def test_run_squad(self): def test_run_squad(self):
stream_handler = logging.StreamHandler(sys.stdout) stream_handler = logging.StreamHandler(sys.stdout)
logger.addHandler(stream_handler) logger.addHandler(stream_handler)
......
*.* *.*
cache* cache*
temp* temp*
!*.txt
!*.tsv !*.tsv
!*.json !*.json
!.gitignore !.gitignore
\ No newline at end of file
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment