Unverified Commit 0ecfd17f authored by Thomas Wolf's avatar Thomas Wolf Committed by GitHub
Browse files

Merge pull request #987 from huggingface/generative-finetuning

Generative finetuning
parents 50792dbd 529a16de
......@@ -129,4 +129,5 @@ proc_data
runs
examples/runs
# data
data
\ No newline at end of file
......@@ -12,8 +12,8 @@ Examples
- How to use gradient-accumulation, multi-gpu training, distributed training, optimize on CPU and 16-bits training to train Bert models
* - `Fine-tuning with BERT: running the examples <#fine-tuning-bert-examples>`_
- Running the examples in `examples <https://github.com/huggingface/pytorch-pretrained-BERT/tree/master/examples>`_\ : ``extract_classif.py``\ , ``run_bert_classifier.py``\ , ``run_bert_squad.py`` and ``run_lm_finetuning.py``
* - `Fine-tuning with OpenAI GPT, Transformer-XL and GPT-2 <#fine-tuning>`_
- Running the examples in `examples <https://github.com/huggingface/pytorch-pretrained-BERT/tree/master/examples>`_\ : ``run_openai_gpt.py``\ , ``run_transfo_xl.py`` and ``run_gpt2.py``
* - `Fine-tuning with OpenAI GPT, Transformer-XL, GPT-2 as well as BERT and RoBERTa <#fine-tuning>`_
- Running the examples in `examples <https://github.com/huggingface/pytorch-pretrained-BERT/tree/master/examples>`_\ : ``run_openai_gpt.py``\ , ``run_transfo_xl.py``, ``run_gpt2.py`` and ``run_lm_finetuning.py``
* - `Fine-tuning BERT-large on GPUs <#fine-tuning-bert-large>`_
- How to fine tune ``BERT large``
......@@ -395,12 +395,13 @@ Thank to the work of @Rocketknight1 and @tholor there are now **several scripts*
OpenAI GPT, Transformer-XL and GPT-2: running the examples
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
We provide three examples of scripts for OpenAI GPT, Transformer-XL and OpenAI GPT-2 based on (and extended from) the respective original implementations:
We provide three examples of scripts for OpenAI GPT, Transformer-XL, OpenAI GPT-2, BERT and RoBERTa based on (and extended from) the respective original implementations:
* fine-tuning OpenAI GPT on the ROCStories dataset
* evaluating Transformer-XL on Wikitext 103
* unconditional and conditional generation from a pre-trained OpenAI GPT-2 model
* fine-tuning GPT/GPT-2 on a causal language modeling task and BERT/RoBERTa on a masked language modeling task
Fine-tuning OpenAI GPT on the RocStories dataset
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
......@@ -454,7 +455,47 @@ Unconditional generation:
python run_gpt2.py --unconditional
The same option as in the original scripts are provided, please refere to the code of the example and the original repository of OpenAI.
The same option as in the original scripts are provided, please refer to the code of the example and the original repository of OpenAI.
Causal LM fine-tuning on GPT/GPT-2, Masked LM fine-tuning on BERT/RoBERTa
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Before running the following examples you should download the `WikiText-2 dataset <https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/>`__ and unpack it to some directory `$WIKITEXT_2_DATASET`
The following results were obtained using the `raw` WikiText-2 (no tokens were replaced before the tokenization).
This example fine-tunes GPT-2 on the WikiText-2 dataset. The loss function is a causal language modeling loss (perplexity).
.. code-block:: bash
export WIKITEXT_2_DATASET=/path/to/wikitext_dataset
python run_lm_finetuning.py
--output_dir=output
--model_type=gpt2
--model_name_or_path=gpt2
--do_train
--train_data_file=$WIKITEXT_2_DATASET/wiki.train.raw
--do_eval
--eval_data_file=$WIKITEXT_2_DATASET/wiki.test.raw
This takes about half an hour to train on a single K80 GPU and about one minute for the evaluation to run.
It reaches a score of about 20 perplexity once fine-tuned on the dataset.
This example fine-tunes RoBERTa on the WikiText-2 dataset. The loss function is a masked language modeling loss (masked perplexity).
The `--mlm` flag is necessary to fine-tune BERT/RoBERTa on masked language modeling.
.. code-block:: bash
export WIKITEXT_2_DATASET=/path/to/wikitext_dataset
python run_lm_finetuning.py
--output_dir=output
--model_type=roberta
--model_name_or_path=roberta-base
--do_train
--train_data_file=$WIKITEXT_2_DATASET/wiki.train.raw
--do_eval
--eval_data_file=$WIKITEXT_2_DATASET/wiki.test.raw
--mlm
.. _fine-tuning-BERT-large:
......
This diff is collapsed.
......@@ -125,6 +125,9 @@ class BertTokenizer(PreTrainedTokenizer):
super(BertTokenizer, self).__init__(unk_token=unk_token, sep_token=sep_token,
pad_token=pad_token, cls_token=cls_token,
mask_token=mask_token, **kwargs)
self.max_len_single_sentence = self.max_len - 2 # take into account special tokens
self.max_len_sentences_pair = self.max_len - 3 # take into account special tokens
if not os.path.isfile(vocab_file):
raise ValueError(
"Can't find a vocabulary file at path '{}'. To load the vocabulary from a Google pretrained "
......
......@@ -108,6 +108,8 @@ class GPT2Tokenizer(PreTrainedTokenizer):
def __init__(self, vocab_file, merges_file, errors='replace', unk_token="<|endoftext|>",
bos_token="<|endoftext|>", eos_token="<|endoftext|>", **kwargs):
super(GPT2Tokenizer, self).__init__(bos_token=bos_token, eos_token=eos_token, unk_token=unk_token, **kwargs)
self.max_len_single_sentence = self.max_len # no default special tokens - you can update this value if you add special tokens
self.max_len_sentences_pair = self.max_len # no default special tokens - you can update this value if you add special tokens
self.encoder = json.load(open(vocab_file))
self.decoder = {v:k for k,v in self.encoder.items()}
......
......@@ -87,6 +87,9 @@ class OpenAIGPTTokenizer(PreTrainedTokenizer):
def __init__(self, vocab_file, merges_file, unk_token="<unk>", **kwargs):
super(OpenAIGPTTokenizer, self).__init__(unk_token=unk_token, **kwargs)
self.max_len_single_sentence = self.max_len # no default special tokens - you can update this value if you add special tokens
self.max_len_sentences_pair = self.max_len # no default special tokens - you can update this value if you add special tokens
try:
import ftfy
from spacy.lang.en import English
......
......@@ -77,6 +77,9 @@ class RobertaTokenizer(PreTrainedTokenizer):
sep_token=sep_token, cls_token=cls_token, pad_token=pad_token,
mask_token=mask_token, **kwargs)
self.max_len_single_sentence = self.max_len - 2 # take into account special tokens
self.max_len_sentences_pair = self.max_len - 4 # take into account special tokens
self.encoder = json.load(open(vocab_file, encoding="utf-8"))
self.decoder = {v: k for k, v in self.encoder.items()}
self.errors = errors # how to handle errors in decoding
......
......@@ -73,6 +73,10 @@ class TransfoXLTokenizer(PreTrainedTokenizer):
super(TransfoXLTokenizer, self).__init__(unk_token=unk_token, eos_token=eos_token,
additional_special_tokens=additional_special_tokens,
**kwargs)
self.max_len_single_sentence = self.max_len # no default special tokens - you can update this value if you add special tokens
self.max_len_sentences_pair = self.max_len # no default special tokens - you can update this value if you add special tokens
if never_split is None:
never_split = self.all_special_tokens
if special is None:
......
......@@ -166,6 +166,9 @@ class PreTrainedTokenizer(object):
self._additional_special_tokens = []
self.max_len = max_len if max_len is not None else int(1e12)
self.max_len_single_sentence = self.max_len
self.max_len_sentences_pair = self.max_len
self.added_tokens_encoder = {}
self.added_tokens_decoder = {}
......@@ -590,10 +593,12 @@ class PreTrainedTokenizer(object):
return first_sentence_tokens, second_sentence_tokens
def add_special_tokens_single_sentence(self, token_ids):
raise NotImplementedError
logger.warning("This tokenizer does not make use of special tokens. The sequence has been returned with no modification.")
return token_ids
def add_special_tokens_sentences_pair(self, token_ids_0, token_ids_1):
raise NotImplementedError
logger.warning("This tokenizer does not make use of special tokens. The two sequences have been concatenated.")
return token_ids_0 + token_ids_1
def convert_ids_to_tokens(self, ids, skip_special_tokens=False):
""" Converts a single index or a sequence of indices (integers) in a token "
......
......@@ -122,6 +122,10 @@ class XLMTokenizer(PreTrainedTokenizer):
cls_token=cls_token, mask_token=mask_token,
additional_special_tokens=additional_special_tokens,
**kwargs)
self.max_len_single_sentence = self.max_len - 2 # take into account special tokens
self.max_len_sentences_pair = self.max_len - 3 # take into account special tokens
try:
import ftfy
from spacy.lang.en import English
......
......@@ -71,6 +71,10 @@ class XLNetTokenizer(PreTrainedTokenizer):
pad_token=pad_token, cls_token=cls_token,
mask_token=mask_token, additional_special_tokens=
additional_special_tokens, **kwargs)
self.max_len_single_sentence = self.max_len - 2 # take into account special tokens
self.max_len_sentences_pair = self.max_len - 3 # take into account special tokens
try:
import sentencepiece as spm
except ImportError:
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment