@@ -131,6 +131,7 @@ This package comprises the following classes that can be imported in Python and
- Configuration classes for BERT, OpenAI GPT and Transformer-XL (in the respective [`modeling.py`](./pytorch_pretrained_bert/modeling.py), [`modeling_openai.py`](./pytorch_pretrained_bert/modeling_openai.py), [`modeling_transfo_xl.py`](./pytorch_pretrained_bert/modeling_transfo_xl.py) files):
-`BertConfig` - Configuration class to store the configuration of a `BertModel` with utilities to read and write from JSON configuration files.
-`OpenAIGPTConfig` - Configuration class to store the configuration of a `OpenAIGPTModel` with utilities to read and write from JSON configuration files.
-`GPT2Config` - Configuration class to store the configuration of a `GPT2Model` with utilities to read and write from JSON configuration files.
-`TransfoXLConfig` - Configuration class to store the configuration of a `TransfoXLModel` with utilities to read and write from JSON configuration files.
The repository further comprises:
...
...
@@ -461,10 +462,12 @@ Here is a detailed documentation of the classes in the package and how to use th
| Sub-section | Description |
|-|-|
| [Loading Google AI's/OpenAI's pre-trained weights](#loading-google-ai-or-openai-pre-trained-weights-or-pytorch-dump) | How to load Google AI/OpenAI's pre-trained weight or a PyTorch saved instance |
| [PyTorch models](#PyTorch-models) | API of the BERT, GPT, GPT-2 and Transformer-XL PyTorch model classes |
| [Loading pre-trained weights](#loading-google-ai-or-openai-pre-trained-weights-or-pytorch-dump) | How to load Google AI/OpenAI's pre-trained weight or a PyTorch saved instance |
| [Serialization best-practices](#serialization-best-practices) | How to save and reload a fine-tuned model |
| [Configurations](#configurations) | API of the configuration classes for BERT, GPT, GPT-2 and Transformer-XL |
| [Models](#models) | API of the PyTorch model classes for BERT, GPT, GPT-2 and Transformer-XL |
| [Tokenizers](#tokenizers) | API of the tokenizers class for BERT, GPT, GPT-2 and Transformer-XL|
| [Optimizers](#optimizerss) | API of the optimizers |
| [Optimizers](#optimizers) | API of the optimizers |
### Loading Google AI or OpenAI pre-trained weights or PyTorch dump
...
...
@@ -524,7 +527,101 @@ model = GPT2Model.from_pretrained('gpt2')
```
### PyTorch models
### Serialization best-practices
This section explain how you can save and re-load a fine-tuned model (BERT, GPT, GPT-2 and Transformer-XL).
There are three types of files you need to save to be able to reload a fine-tuned model:
- the model it-self which should be saved following PyTorch serialization [best practices](https://pytorch.org/docs/stable/notes/serialization.html#best-practices),
- the configuration file of the model which is saved as a JSON file, and
- the vocabulary (and the merges for the BPE-based models GPT and GPT-2).
Here is the recommended way of saving the model, configuration and vocabulary to an `output_dir` directory and reloading the model and tokenizer afterwards:
Models (BERT, GPT, GPT-2 and Transformer-XL) are defined and build from configuration classes which containes the parameters of the models (number of layers, dimensionalities...) and a few utilities to read and write from JSON configuration files. The respective configuration classes are:
-`BertConfig` for `BertModel` and BERT classes instances.
-`OpenAIGPTConfig` for `OpenAIGPTModel` and OpenAI GPT classes instances.
-`GPT2Config` for `GPT2Model` and OpenAI GPT-2 classes instances.
-`TransfoXLConfig` for `TransfoXLModel` and Transformer-XL classes instances.
These configuration classes contains a few utilities to load and save configurations:
-`from_dict(cls, json_object)`: A class method to construct a configuration from a Python dictionary of parameters. Returns an instance of the configuration class.
-`from_json_file(cls, json_file)`: A class method to construct a configuration from a json file of parameters. Returns an instance of the configuration class.
-`to_dict()`: Serializes an instance to a Python dictionary. Returns a dictionary.
-`to_json_string()`: Serializes an instance to a JSON string. Returns a string.
-`to_json_file(json_file_path)`: Save an instance to a json file.
### Models
#### 1. `BertModel`
...
...
@@ -796,8 +893,7 @@ This model *outputs*:
-`multiple_choice_logits`: the multiple choice logits as a torch.FloatTensor of size [batch_size, num_choices]
-`presents`: a list of pre-computed hidden-states (key and values in each attention blocks) as a torch.FloatTensors. They can be reused to speed up sequential decoding (see the `run_gpt2.py` example).
### Tokenizers:
### Tokenizers
#### `BertTokenizer`
...
...
@@ -816,6 +912,7 @@ and three methods:
-`tokenize(text)`: convert a `str` in a list of `str` tokens by (1) performing basic tokenization and (2) WordPiece tokenization.
-`convert_tokens_to_ids(tokens)`: convert a list of `str` tokens in a list of `int` indices in the vocabulary.
-`convert_ids_to_tokens(tokens)`: convert a list of `int` indices in a list of `str` tokens in the vocabulary.
-`save_vocabulary(directory_path)`: save the vocabulary file to `directory_path`. Return the path to the saved vocabulary file: `vocab_file_path`. The vocabulary can be reloaded with `BertTokenizer.from_pretrained('vocab_file_path')` or `BertTokenizer.from_pretrained('directory_path')`.
Please refer to the doc strings and code in [`tokenization.py`](./pytorch_pretrained_bert/tokenization.py) for the details of the `BasicTokenizer` and `WordpieceTokenizer` classes. In general it is recommended to use `BertTokenizer` unless you know what you are doing.
...
...
@@ -832,11 +929,13 @@ This class has four arguments:
and five methods:
-`tokenize(text)`: convert a `str` in a list of `str` tokens by (1) performing basic tokenization and (2) WordPiece tokenization.
-`tokenize(text)`: convert a `str` in a list of `str` tokens by performing BPE tokenization.
-`convert_tokens_to_ids(tokens)`: convert a list of `str` tokens in a list of `int` indices in the vocabulary.
-`convert_ids_to_tokens(tokens)`: convert a list of `int` indices in a list of `str` tokens in the vocabulary.
-`set_special_tokens(self, special_tokens)`: update the list of special tokens (see above arguments)
-`encode(text)`: convert a `str` in a list of `int` tokens by performing BPE encoding.
-`decode(ids, skip_special_tokens=False, clean_up_tokenization_spaces=False)`: decode a list of `int` indices in a string and do some post-processing if needed: (i) remove special tokens from the output and (ii) clean up tokenization spaces.
-`save_vocabulary(directory_path)`: save the vocabulary, merge and special tokens files to `directory_path`. Return the path to the three files: `vocab_file_path`, `merge_file_path`, `special_tokens_file_path`. The vocabulary can be reloaded with `OpenAIGPTTokenizer.from_pretrained('directory_path')`.
Please refer to the doc strings and code in [`tokenization_openai.py`](./pytorch_pretrained_bert/tokenization_openai.py) for the details of the `OpenAIGPTTokenizer`.
...
...
@@ -844,6 +943,8 @@ Please refer to the doc strings and code in [`tokenization_openai.py`](./pytorch
`TransfoXLTokenizer` perform word tokenization. This tokenizer can be used for adaptive softmax and has utilities for counting tokens in a corpus to create a vocabulary ordered by toekn frequency (for adaptive softmax). See the adaptive softmax paper ([Efficient softmax approximation for GPUs](http://arxiv.org/abs/1609.04309)) for more details.
The API is similar to the API of `BertTokenizer` (see above).
Please refer to the doc strings and code in [`tokenization_transfo_xl.py`](./pytorch_pretrained_bert/tokenization_transfo_xl.py) for the details of these additional methods in `TransfoXLTokenizer`.
#### `GPT2Tokenizer`
...
...
@@ -858,13 +959,17 @@ This class has three arguments:
and two methods:
-`tokenize(text)`: convert a `str` in a list of `str` tokens by performing byte-level BPE.
-`convert_tokens_to_ids(tokens)`: convert a list of `str` tokens in a list of `int` indices in the vocabulary.
-`convert_ids_to_tokens(tokens)`: convert a list of `int` indices in a list of `str` tokens in the vocabulary.
-`set_special_tokens(self, special_tokens)`: update the list of special tokens (see above arguments)
-`encode(text)`: convert a `str` in a list of `int` tokens by performing byte-level BPE.
-`decode(tokens)`: convert back a list of `int` tokens in a `str`.
-`save_vocabulary(directory_path)`: save the vocabulary, merge and special tokens files to `directory_path`. Return the path to the three files: `vocab_file_path`, `merge_file_path`, `special_tokens_file_path`. The vocabulary can be reloaded with `OpenAIGPTTokenizer.from_pretrained('directory_path')`.
Please refer to [`tokenization_gpt2.py`](./pytorch_pretrained_bert/tokenization_gpt2.py) for more details on the `GPT2Tokenizer`.
### Optimizers:
### Optimizers
#### `BertAdam`
...
...
@@ -1174,18 +1279,20 @@ To get these results we used a combination of:
Here is the full list of hyper-parameters for this run:
```bash
export SQUAD_DIR=/path/to/SQUAD
python ./run_squad.py \
--bert_model bert-large-uncased \
--do_train\
--do_predict\
--do_lower_case\
--train_file$SQUAD_TRAIN\
--predict_file$SQUAD_EVAL\
--train_file$SQUAD_DIR/train-v1.1.json\
--predict_file$SQUAD_DIR/dev-v1.1.json\
--learning_rate 3e-5 \
--num_train_epochs 2 \
--max_seq_length 384 \
--doc_stride 128 \
--output_dir$OUTPUT_DIR\
--output_dir/tmp/debug_squad/\
--train_batch_size 24 \
--gradient_accumulation_steps 2
```
...
...
@@ -1194,18 +1301,20 @@ If you have a recent GPU (starting from NVIDIA Volta series), you should try **1
Here is an example of hyper-parameters for a FP16 run we tried:
@@ -930,7 +937,7 @@ class BertForSequenceClassification(BertPreTrainedModel):
Inputs:
`input_ids`: a torch.LongTensor of shape [batch_size, sequence_length]
with the word token indices in the vocabulary(see the tokens preprocessing logic in the scripts
with the word token indices in the vocabulary. Items in the batch should begin with the special "CLS" token. (see the tokens preprocessing logic in the scripts
`extract_features.py`, `run_classifier.py` and `run_squad.py`)
`token_type_ids`: an optional torch.LongTensor of shape [batch_size, sequence_length] with the token
types indices selected in [0, 1]. Type 0 corresponds to a `sentence A` and type 1 corresponds to