Merge pull request #987 from huggingface/generative-finetuning

Generative finetuning

Merge pull request #987 from huggingface/generative-finetuning
Generative finetuning
0ecfd17f · Thomas Wolf · GitHub · 50792dbd · 529a16de · 0ecfd17f
Unverified Commit 0ecfd17f authored Aug 28, 2019 by Thomas Wolf Committed by GitHub Aug 28, 2019
11 changed files
--- a/.gitignore
+++ b/.gitignore
@@ -129,4 +129,5 @@ proc_data
 runs
 examples/runs
+# data
 data
\ No newline at end of file
--- a/docs/source/examples.rst
+++ b/docs/source/examples.rst
@@ -12,8 +12,8 @@ Examples
     - How to use gradient-accumulation, multi-gpu training, distributed training, optimize on CPU and 16-bits training to train Bert models
   * - `Fine-tuning with BERT: running the examples <#fine-tuning-bert-examples>`_
     - Running the examples in `examples <https://github.com/huggingface/pytorch-pretrained-BERT/tree/master/examples>`_\ : ``extract_classif.py``\ , ``run_bert_classifier.py``\ , ``run_bert_squad.py`` and ``run_lm_finetuning.py``
-   * - `Fine-tuning with OpenAI GPT, Transformer-XL and GPT-2 <#fine-tuning>`_
+   * - `Fine-tuning with OpenAI GPT, Transformer-XL, GPT-2 as well as BERT and RoBERTa <#fine-tuning>`_
-     - Running the examples in `examples <https://github.com/huggingface/pytorch-pretrained-BERT/tree/master/examples>`_\ : ``run_openai_gpt.py``\ , ``run_transfo_xl.py`` and ``run_gpt2.py``
+     - Running the examples in `examples <https://github.com/huggingface/pytorch-pretrained-BERT/tree/master/examples>`_\ : ``run_openai_gpt.py``\ , ``run_transfo_xl.py``, ``run_gpt2.py`` and ``run_lm_finetuning.py``
   * - `Fine-tuning BERT-large on GPUs <#fine-tuning-bert-large>`_
     - How to fine tune ``BERT large``
@@ -395,12 +395,13 @@ Thank to the work of @Rocketknight1 and @tholor there are now **several scripts*
 OpenAI GPT, Transformer-XL and GPT-2: running the examples
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-We provide three examples of scripts for OpenAI GPT, Transformer-XL and OpenAI GPT-2 based on (and extended from) the respective original implementations:
+We provide three examples of scripts for OpenAI GPT, Transformer-XL, OpenAI GPT-2, BERT and RoBERTa based on (and extended from) the respective original implementations:
 * fine-tuning OpenAI GPT on the ROCStories dataset
 * evaluating Transformer-XL on Wikitext 103
 * unconditional and conditional generation from a pre-trained OpenAI GPT-2 model
+* fine-tuning GPT/GPT-2 on a causal language modeling task and BERT/RoBERTa on a masked language modeling task
 Fine-tuning OpenAI GPT on the RocStories dataset
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -454,7 +455,47 @@ Unconditional generation:
   python run_gpt2.py --unconditional
-The same option as in the original scripts are provided, please refere to the code of the example and the original repository of OpenAI.
+The same option as in the original scripts are provided, please refer to the code of the example and the original repository of OpenAI.
+Causal LM fine-tuning on GPT/GPT-2, Masked LM fine-tuning on BERT/RoBERTa
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+Before running the following examples you should download the `WikiText-2 dataset <https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/>`__ and unpack it to some directory `$WIKITEXT_2_DATASET`
+The following results were obtained using the `raw` WikiText-2 (no tokens were replaced before the tokenization).
+This example fine-tunes GPT-2 on the WikiText-2 dataset. The loss function is a causal language modeling loss (perplexity).
+.. code-block:: bash
+    export WIKITEXT_2_DATASET=/path/to/wikitext_dataset
+    python run_lm_finetuning.py
+        --output_dir=output
+        --model_type=gpt2
+        --model_name_or_path=gpt2
+        --do_train
+        --train_data_file=$WIKITEXT_2_DATASET/wiki.train.raw
+        --do_eval
+        --eval_data_file=$WIKITEXT_2_DATASET/wiki.test.raw
+This takes about half an hour to train on a single K80 GPU and about one minute for the evaluation to run.
+It reaches a score of about 20 perplexity once fine-tuned on the dataset.
+This example fine-tunes RoBERTa on the WikiText-2 dataset. The loss function is a masked language modeling loss (masked perplexity).
+The `--mlm` flag is necessary to fine-tune BERT/RoBERTa on masked language modeling.
+.. code-block:: bash
+    export WIKITEXT_2_DATASET=/path/to/wikitext_dataset
+    python run_lm_finetuning.py
+        --output_dir=output
+        --model_type=roberta
+        --model_name_or_path=roberta-base
+        --do_train
+        --train_data_file=$WIKITEXT_2_DATASET/wiki.train.raw
+        --do_eval
+        --eval_data_file=$WIKITEXT_2_DATASET/wiki.test.raw
+        --mlm
 .. _fine-tuning-BERT-large:

--- a/examples/run_lm_finetuning.py
+++ b/examples/run_lm_finetuning.py
--- a/pytorch_transformers/tokenization_bert.py
+++ b/pytorch_transformers/tokenization_bert.py
@@ -125,6 +125,9 @@ class BertTokenizer(PreTrainedTokenizer):
        super(BertTokenizer, self).__init__(unk_token=unk_token, sep_token=sep_token,
                                            pad_token=pad_token, cls_token=cls_token,
                                            mask_token=mask_token, **kwargs)
+        self.max_len_single_sentence = self.max_len - 2  # take into account special tokens
+        self.max_len_sentences_pair = self.max_len - 3  # take into account special tokens
        if not os.path.isfile(vocab_file):
            raise ValueError(
                "Can't find a vocabulary file at path '{}'. To load the vocabulary from a Google pretrained "

--- a/pytorch_transformers/tokenization_gpt2.py
+++ b/pytorch_transformers/tokenization_gpt2.py
@@ -108,6 +108,8 @@ class GPT2Tokenizer(PreTrainedTokenizer):
    def __init__(self, vocab_file, merges_file, errors='replace', unk_token="<|endoftext|>",
                 bos_token="<|endoftext|>", eos_token="<|endoftext|>", **kwargs):
        super(GPT2Tokenizer, self).__init__(bos_token=bos_token, eos_token=eos_token, unk_token=unk_token, **kwargs)
+        self.max_len_single_sentence = self.max_len # no default special tokens - you can update this value if you add special tokens
+        self.max_len_sentences_pair = self.max_len # no default special tokens - you can update this value if you add special tokens
        self.encoder = json.load(open(vocab_file))
        self.decoder = {v:k for k,v in self.encoder.items()}

--- a/pytorch_transformers/tokenization_openai.py
+++ b/pytorch_transformers/tokenization_openai.py
@@ -87,6 +87,9 @@ class OpenAIGPTTokenizer(PreTrainedTokenizer):
    def __init__(self, vocab_file, merges_file, unk_token="<unk>", **kwargs):
        super(OpenAIGPTTokenizer, self).__init__(unk_token=unk_token, **kwargs)
+        self.max_len_single_sentence = self.max_len # no default special tokens - you can update this value if you add special tokens
+        self.max_len_sentences_pair = self.max_len # no default special tokens - you can update this value if you add special tokens
        try:
            import ftfy
            from spacy.lang.en import English

--- a/pytorch_transformers/tokenization_roberta.py
+++ b/pytorch_transformers/tokenization_roberta.py
@@ -77,6 +77,9 @@ class RobertaTokenizer(PreTrainedTokenizer):
                                               sep_token=sep_token, cls_token=cls_token, pad_token=pad_token,
                                               mask_token=mask_token, **kwargs)
+        self.max_len_single_sentence = self.max_len - 2  # take into account special tokens
+        self.max_len_sentences_pair = self.max_len - 4  # take into account special tokens
        self.encoder = json.load(open(vocab_file, encoding="utf-8"))
        self.decoder = {v: k for k, v in self.encoder.items()}
        self.errors = errors  # how to handle errors in decoding

--- a/pytorch_transformers/tokenization_transfo_xl.py
+++ b/pytorch_transformers/tokenization_transfo_xl.py
@@ -73,6 +73,10 @@ class TransfoXLTokenizer(PreTrainedTokenizer):
        super(TransfoXLTokenizer, self).__init__(unk_token=unk_token, eos_token=eos_token,
                                                 additional_special_tokens=additional_special_tokens,
                                                 **kwargs)
+        self.max_len_single_sentence = self.max_len # no default special tokens - you can update this value if you add special tokens
+        self.max_len_sentences_pair = self.max_len # no default special tokens - you can update this value if you add special tokens
        if never_split is None:
            never_split = self.all_special_tokens
        if special is None:

--- a/pytorch_transformers/tokenization_utils.py
+++ b/pytorch_transformers/tokenization_utils.py
@@ -166,6 +166,9 @@ class PreTrainedTokenizer(object):
        self._additional_special_tokens = []
        self.max_len = max_len if max_len is not None else int(1e12)
+        self.max_len_single_sentence = self.max_len
+        self.max_len_sentences_pair = self.max_len
        self.added_tokens_encoder = {}
        self.added_tokens_decoder = {}
@@ -590,10 +593,12 @@ class PreTrainedTokenizer(object):
            return first_sentence_tokens, second_sentence_tokens
    def add_special_tokens_single_sentence(self, token_ids):
-        raise NotImplementedError
+        logger.warning("This tokenizer does not make use of special tokens. The sequence has been returned with no modification.")
+        return token_ids
    def add_special_tokens_sentences_pair(self, token_ids_0, token_ids_1):
-        raise NotImplementedError
+        logger.warning("This tokenizer does not make use of special tokens. The two sequences have been concatenated.")
+        return token_ids_0 + token_ids_1
    def convert_ids_to_tokens(self, ids, skip_special_tokens=False):
        """ Converts a single index or a sequence of indices (integers) in a token "

--- a/pytorch_transformers/tokenization_xlm.py
+++ b/pytorch_transformers/tokenization_xlm.py
@@ -122,6 +122,10 @@ class XLMTokenizer(PreTrainedTokenizer):
                                           cls_token=cls_token, mask_token=mask_token,
                                           additional_special_tokens=additional_special_tokens,
                                           **kwargs)
+        self.max_len_single_sentence = self.max_len - 2  # take into account special tokens
+        self.max_len_sentences_pair = self.max_len - 3  # take into account special tokens
        try:
            import ftfy
            from spacy.lang.en import English

--- a/pytorch_transformers/tokenization_xlnet.py
+++ b/pytorch_transformers/tokenization_xlnet.py
@@ -71,6 +71,10 @@ class XLNetTokenizer(PreTrainedTokenizer):
                                             pad_token=pad_token, cls_token=cls_token,
                                             mask_token=mask_token, additional_special_tokens=
                                             additional_special_tokens, **kwargs)
+        self.max_len_single_sentence = self.max_len - 2  # take into account special tokens
+        self.max_len_sentences_pair = self.max_len - 3  # take into account special tokens
        try:
            import sentencepiece as spm
        except ImportError: