@@ -131,11 +131,8 @@ This package comprises the following classes that can be imported in Python and
- Tokenizer for **OpenAI GPT-2** (using byte-level Byte-Pair-Encoding) (in the [`tokenization_gpt2.py`](./pytorch_transformers/tokenization_gpt2.py) file):
- Optimizer for **BERT** (in the [`optimization.py`](./pytorch_transformers/optimization.py) file):
-`BertAdam` - Bert version of Adam algorithm with weight decay fix, warmup and linear decay of the learning rate.
- Optimizer for **OpenAI GPT** (in the [`optimization_openai.py`](./pytorch_transformers/optimization_openai.py) file):
-`OpenAIAdam` - OpenAI GPT version of Adam algorithm with weight decay fix, warmup and linear decay of the learning rate.
- Optimizer (in the [`optimization.py`](./pytorch_transformers/optimization.py) file):
-`AdamW` - Version of Adam algorithm with weight decay fix, warmup and linear decay of the learning rate.
- Configuration classes for BERT, OpenAI GPT and Transformer-XL (in the respective [`modeling.py`](./pytorch_transformers/modeling.py), [`modeling_openai.py`](./pytorch_transformers/modeling_openai.py), [`modeling_transfo_xl.py`](./pytorch_transformers/modeling_transfo_xl.py) files):
-`BertConfig` - Configuration class to store the configuration of a `BertModel` with utilities to read and write from JSON configuration files.
...
...
@@ -1104,12 +1101,11 @@ Please refer to [`tokenization_gpt2.py`](./pytorch_transformers/tokenization_gpt
### Optimizers
#### `BertAdam`
#### `AdamW`
`BertAdam` is a `torch.optimizer` adapted to be closer to the optimizer used in the TensorFlow implementation of Bert. The differences with PyTorch Adam optimizer are the following:
`AdamW` is a `torch.optimizer` adapted to be closer to the optimizer used in the TensorFlow implementation of Bert. The differences with PyTorch Adam optimizer are the following:
- BertAdam implements weight decay fix,
- BertAdam doesn't compensate for bias as in the regular Adam optimizer.
- AdamW implements weight decay fix,
The optimizer accepts the following arguments:
...
...
@@ -1127,13 +1123,6 @@ The optimizer accepts the following arguments:
-`weight_decay:` Weight decay. Default : `0.01`
-`max_grad_norm` : Maximum norm for the gradients (`-1` means no clipping). Default : `1.0`
#### `OpenAIAdam`
`OpenAIAdam` is similar to `BertAdam`.
The differences with `BertAdam` is that `OpenAIAdam` compensate for bias as in the regular Adam optimizer.
`OpenAIAdam` accepts the same arguments as `BertAdam`.
#### Learning Rate Schedules
The `.optimization` module also provides additional schedules in the form of schedule objects that inherit from `_LRSchedule`.
@@ -60,10 +60,10 @@ This PyTorch implementation of Transformer-XL is an adaptation of the original `
This PyTorch implementation of OpenAI GPT-2 is an adaptation of the `OpenAI's implementation <https://github.com/openai/gpt-2>`__ and is provided with `OpenAI's pre-trained model <https://github.com/openai/gpt-2>`__ and a command-line interface that was used to convert the TensorFlow checkpoint in PyTorch.
**Facebook Research's XLM** was released together with the paper `Cross-lingual Language Model Pretraining <https://arxiv.org/abs/1901.07291>`__ by Guillaume Lample and Alexis Conneau.
This PyTorch implementation of XLM is an adaptation of the original `PyTorch implementation <https://github.com/facebookresearch/XLM>`__. TODO Lysandre filled
This PyTorch implementation of XLM is an adaptation of the original `PyTorch implementation <https://github.com/facebookresearch/XLM>`__.
**Google's XLNet** was released together with the paper `XLNet: Generalized Autoregressive Pretraining for Language Understanding <https://arxiv.org/abs/1906.08237>`__ by Zhilin Yang\*, Zihang Dai\*, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov and Quoc V. Le.
This PyTorch implementation of XLM is an adaptation of the `Tensorflow implementation <https://github.com/zihangdai/xlnet>`__. TODO Lysandre filled
This PyTorch implementation of XLM is an adaptation of the `Tensorflow implementation <https://github.com/zihangdai/xlnet>`__.
Content
...
...
@@ -91,7 +91,7 @@ Content
* - `Migration <./migration.html>`__
- Migrating from ``pytorch_pretrained_BERT`` (v0.6) to ``pytorch_transformers`` (v1.0)
* - `Bertology <./bertology.html>`__
- TODO Lysandre didn't know how to fill
- Exploring the internals of the pretrained models.
* - `TorchScript <./torchscript.html>`__
- Convert a model to TorchScript for use in other programming languages
...
...
@@ -115,8 +115,6 @@ Content
* - `XLNet <./model_doc/xlnet.html>`__
- XLNet Models, Tokenizers and optimizers
TODO Lysandre filled: might need an introduction for both parts. Is it even necessary, since there is a summary? Up to you Thom.
Overview
--------
...
...
@@ -219,17 +217,10 @@ TODO Lysandre filled: I filled in XLM and XLNet. I didn't do the Tokenizers beca
*
Optimizer for **BERT** (in the `optimization.py <./_modules/pytorch_transformers/optimization.html>`__ file):
* ``BertAdam`` - Bert version of Adam algorithm with weight decay fix, warmup and linear decay of the learning rate.
*
Optimizer for **OpenAI GPT** (in the `optimization_openai.py <./_modules/pytorch_transformers/optimization_openai.html>`__ file):
Optimizer (in the `optimization.py <./_modules/pytorch_transformers/optimization.html>`__ file):
* ``OpenAIAdam`` - OpenAI GPT version of Adam algorithm with weight decay fix, warmup and linear decay of the learning rate.
* ``AdamW`` - Version of Adam algorithm with weight decay fix, warmup and linear decay of the learning rate.
logger.info("Pruning: original num of params: %.2e, after pruning %.2e (%.1f percents)",original_num_params,pruned_num_params,pruned_num_params/original_num_params*100)
logger.info("Pruning: score with masking: %f score with pruning: %f",score_masking,score_pruning)
logger.info("Pruning: speed ratio (new timing / original timing): %f percents",original_time/new_time*100)
defmain():
parser=argparse.ArgumentParser()
parser.add_argument('--model_name_or_path',type=str,default='bert-base-cased-finetuned-mrpc',help='pretrained model name or path to local checkpoint')
parser.add_argument("--task_name",type=str,default='mrpc',help="The name of the task to train.")
parser.add_argument("--data_dir",type=str,required=True,help="The input data dir. Should contain the .tsv files (or other data files) for the task.")
parser.add_argument("--output_dir",type=str,required=True,help="The output directory where the model predictions and checkpoints will be written.")
parser.add_argument("--data_subset",type=int,default=-1,help="If > 0: limit the data to a subset of data_subset instances.")
parser.add_argument("--overwrite_output_dir",action='store_true',help="Whether to overwrite data in output directory")
parser.add_argument("--dont_normalize_importance_by_layer",action='store_true',help="Don't normalize importance score by layers")
parser.add_argument("--dont_normalize_global_importance",action='store_true',help="Don't normalize all importance scores between 0 and 1")
parser.add_argument("--try_masking",action='store_true',help="Whether to try to mask head until a threshold of accuracy.")
parser.add_argument("--masking_threshold",default=0.9,type=float,help="masking threshold in term of metrics"
"(stop masking when metric < threshold * original metric value).")
parser.add_argument("--masking_amount",default=0.1,type=float,help="Amount to heads to masking at each masking step.")
parser.add_argument("--metric_name",default="acc",type=str,help="Metric to use for head masking.")
parser.add_argument("--max_seq_length",default=128,type=int,help="The maximum total input sequence length after WordPiece tokenization. \n"
"Sequences longer than this will be truncated, and sequences shorter \n"
logger.info("Pruning: original num of params: %.2e, after pruning %.2e (%.1f percents)",original_num_params,pruned_num_params,pruned_num_params/original_num_params*100)
logger.info("Pruning: score with masking: %f score with pruning: %f",score_masking,score_pruning)
logger.info("Pruning: speed ratio (new timing / original timing): %f percents",original_time/new_time*100)
Performs a model forward pass. **Can be called by calling the class directly, once it has been instantiated.**
Parameters:
`input_ids`: a ``torch.LongTensor`` of shape [batch_size, sequence_length]
with the word token indices in the vocabulary. Items in the batch should begin with the special "CLS" token. (see the tokens preprocessing logic in the scripts
`run_bert_extract_features.py`, `run_bert_classifier.py` and `run_bert_squad.py`)
`token_type_ids`: an optional ``torch.LongTensor`` of shape [batch_size, sequence_length] with the token
types indices selected in [0, 1]. Type 0 corresponds to a `sentence A` and type 1 corresponds to
a `sentence B` token (see BERT paper for more details).
`attention_mask`: an optional ``torch.LongTensor`` of shape [batch_size, sequence_length] with indices
selected in [0, 1]. It's a mask to be used if the input sequence length is smaller than the max
input sequence length in the current batch. It's the mask that we typically use for attention when
a batch has varying length sentences.
`labels`: labels for the classification output: ``torch.LongTensor`` of shape [batch_size]
with indices selected in [0, ..., num_labels].
`head_mask`: an optional ``torch.Tensor`` of shape [num_heads] or [num_layers, num_heads] with indices between 0 and 1.
It's a mask to be used to nullify some heads of the transformer. 1.0 => head is fully masked, 0.0 => head is not masked.
Returns:
If ``labels`` is not ``None``, outputs the CrossEntropy classification loss of the output with the labels.
If ``labels`` is ``None``, outputs the classification logits of shape [batch_size, num_labels].
@add_start_docstrings("""Bert Model transformer BERT model with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layers on top of
the hidden-states output to compute `span start logits` and `span end logits`). """,