"vscode:/vscode.git/clone" did not exist on "142b69f24b57e5d358edeaa1569955f7684fee93"
Commit 2397f958 authored by thomwolf's avatar thomwolf
Browse files

updating examples and doc

parent c490f5ce
...@@ -131,11 +131,8 @@ This package comprises the following classes that can be imported in Python and ...@@ -131,11 +131,8 @@ This package comprises the following classes that can be imported in Python and
- Tokenizer for **OpenAI GPT-2** (using byte-level Byte-Pair-Encoding) (in the [`tokenization_gpt2.py`](./pytorch_transformers/tokenization_gpt2.py) file): - Tokenizer for **OpenAI GPT-2** (using byte-level Byte-Pair-Encoding) (in the [`tokenization_gpt2.py`](./pytorch_transformers/tokenization_gpt2.py) file):
- `GPT2Tokenizer` - perform byte-level Byte-Pair-Encoding (BPE) tokenization. - `GPT2Tokenizer` - perform byte-level Byte-Pair-Encoding (BPE) tokenization.
- Optimizer for **BERT** (in the [`optimization.py`](./pytorch_transformers/optimization.py) file): - Optimizer (in the [`optimization.py`](./pytorch_transformers/optimization.py) file):
- `BertAdam` - Bert version of Adam algorithm with weight decay fix, warmup and linear decay of the learning rate. - `AdamW` - Version of Adam algorithm with weight decay fix, warmup and linear decay of the learning rate.
- Optimizer for **OpenAI GPT** (in the [`optimization_openai.py`](./pytorch_transformers/optimization_openai.py) file):
- `OpenAIAdam` - OpenAI GPT version of Adam algorithm with weight decay fix, warmup and linear decay of the learning rate.
- Configuration classes for BERT, OpenAI GPT and Transformer-XL (in the respective [`modeling.py`](./pytorch_transformers/modeling.py), [`modeling_openai.py`](./pytorch_transformers/modeling_openai.py), [`modeling_transfo_xl.py`](./pytorch_transformers/modeling_transfo_xl.py) files): - Configuration classes for BERT, OpenAI GPT and Transformer-XL (in the respective [`modeling.py`](./pytorch_transformers/modeling.py), [`modeling_openai.py`](./pytorch_transformers/modeling_openai.py), [`modeling_transfo_xl.py`](./pytorch_transformers/modeling_transfo_xl.py) files):
- `BertConfig` - Configuration class to store the configuration of a `BertModel` with utilities to read and write from JSON configuration files. - `BertConfig` - Configuration class to store the configuration of a `BertModel` with utilities to read and write from JSON configuration files.
...@@ -1104,12 +1101,11 @@ Please refer to [`tokenization_gpt2.py`](./pytorch_transformers/tokenization_gpt ...@@ -1104,12 +1101,11 @@ Please refer to [`tokenization_gpt2.py`](./pytorch_transformers/tokenization_gpt
### Optimizers ### Optimizers
#### `BertAdam` #### `AdamW`
`BertAdam` is a `torch.optimizer` adapted to be closer to the optimizer used in the TensorFlow implementation of Bert. The differences with PyTorch Adam optimizer are the following: `AdamW` is a `torch.optimizer` adapted to be closer to the optimizer used in the TensorFlow implementation of Bert. The differences with PyTorch Adam optimizer are the following:
- BertAdam implements weight decay fix, - AdamW implements weight decay fix,
- BertAdam doesn't compensate for bias as in the regular Adam optimizer.
The optimizer accepts the following arguments: The optimizer accepts the following arguments:
...@@ -1127,13 +1123,6 @@ The optimizer accepts the following arguments: ...@@ -1127,13 +1123,6 @@ The optimizer accepts the following arguments:
- `weight_decay:` Weight decay. Default : `0.01` - `weight_decay:` Weight decay. Default : `0.01`
- `max_grad_norm` : Maximum norm for the gradients (`-1` means no clipping). Default : `1.0` - `max_grad_norm` : Maximum norm for the gradients (`-1` means no clipping). Default : `1.0`
#### `OpenAIAdam`
`OpenAIAdam` is similar to `BertAdam`.
The differences with `BertAdam` is that `OpenAIAdam` compensate for bias as in the regular Adam optimizer.
`OpenAIAdam` accepts the same arguments as `BertAdam`.
#### Learning Rate Schedules #### Learning Rate Schedules
The `.optimization` module also provides additional schedules in the form of schedule objects that inherit from `_LRSchedule`. The `.optimization` module also provides additional schedules in the form of schedule objects that inherit from `_LRSchedule`.
......
...@@ -60,10 +60,10 @@ This PyTorch implementation of Transformer-XL is an adaptation of the original ` ...@@ -60,10 +60,10 @@ This PyTorch implementation of Transformer-XL is an adaptation of the original `
This PyTorch implementation of OpenAI GPT-2 is an adaptation of the `OpenAI's implementation <https://github.com/openai/gpt-2>`__ and is provided with `OpenAI's pre-trained model <https://github.com/openai/gpt-2>`__ and a command-line interface that was used to convert the TensorFlow checkpoint in PyTorch. This PyTorch implementation of OpenAI GPT-2 is an adaptation of the `OpenAI's implementation <https://github.com/openai/gpt-2>`__ and is provided with `OpenAI's pre-trained model <https://github.com/openai/gpt-2>`__ and a command-line interface that was used to convert the TensorFlow checkpoint in PyTorch.
**Facebook Research's XLM** was released together with the paper `Cross-lingual Language Model Pretraining <https://arxiv.org/abs/1901.07291>`__ by Guillaume Lample and Alexis Conneau. **Facebook Research's XLM** was released together with the paper `Cross-lingual Language Model Pretraining <https://arxiv.org/abs/1901.07291>`__ by Guillaume Lample and Alexis Conneau.
This PyTorch implementation of XLM is an adaptation of the original `PyTorch implementation <https://github.com/facebookresearch/XLM>`__. TODO Lysandre filled This PyTorch implementation of XLM is an adaptation of the original `PyTorch implementation <https://github.com/facebookresearch/XLM>`__.
**Google's XLNet** was released together with the paper `XLNet: Generalized Autoregressive Pretraining for Language Understanding <https://arxiv.org/abs/1906.08237>`__ by Zhilin Yang\*, Zihang Dai\*, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov and Quoc V. Le. **Google's XLNet** was released together with the paper `XLNet: Generalized Autoregressive Pretraining for Language Understanding <https://arxiv.org/abs/1906.08237>`__ by Zhilin Yang\*, Zihang Dai\*, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov and Quoc V. Le.
This PyTorch implementation of XLM is an adaptation of the `Tensorflow implementation <https://github.com/zihangdai/xlnet>`__. TODO Lysandre filled This PyTorch implementation of XLM is an adaptation of the `Tensorflow implementation <https://github.com/zihangdai/xlnet>`__.
Content Content
...@@ -91,7 +91,7 @@ Content ...@@ -91,7 +91,7 @@ Content
* - `Migration <./migration.html>`__ * - `Migration <./migration.html>`__
- Migrating from ``pytorch_pretrained_BERT`` (v0.6) to ``pytorch_transformers`` (v1.0) - Migrating from ``pytorch_pretrained_BERT`` (v0.6) to ``pytorch_transformers`` (v1.0)
* - `Bertology <./bertology.html>`__ * - `Bertology <./bertology.html>`__
- TODO Lysandre didn't know how to fill - Exploring the internals of the pretrained models.
* - `TorchScript <./torchscript.html>`__ * - `TorchScript <./torchscript.html>`__
- Convert a model to TorchScript for use in other programming languages - Convert a model to TorchScript for use in other programming languages
...@@ -115,8 +115,6 @@ Content ...@@ -115,8 +115,6 @@ Content
* - `XLNet <./model_doc/xlnet.html>`__ * - `XLNet <./model_doc/xlnet.html>`__
- XLNet Models, Tokenizers and optimizers - XLNet Models, Tokenizers and optimizers
TODO Lysandre filled: might need an introduction for both parts. Is it even necessary, since there is a summary? Up to you Thom.
Overview Overview
-------- --------
...@@ -219,17 +217,10 @@ TODO Lysandre filled: I filled in XLM and XLNet. I didn't do the Tokenizers beca ...@@ -219,17 +217,10 @@ TODO Lysandre filled: I filled in XLM and XLNet. I didn't do the Tokenizers beca
* *
Optimizer for **BERT** (in the `optimization.py <./_modules/pytorch_transformers/optimization.html>`__ file): Optimizer (in the `optimization.py <./_modules/pytorch_transformers/optimization.html>`__ file):
* ``BertAdam`` - Bert version of Adam algorithm with weight decay fix, warmup and linear decay of the learning rate.
*
Optimizer for **OpenAI GPT** (in the `optimization_openai.py <./_modules/pytorch_transformers/optimization_openai.html>`__ file):
* ``OpenAIAdam`` - OpenAI GPT version of Adam algorithm with weight decay fix, warmup and linear decay of the learning rate. * ``AdamW`` - Version of Adam algorithm with weight decay fix, warmup and linear decay of the learning rate.
* *
......
...@@ -15,10 +15,10 @@ BERT ...@@ -15,10 +15,10 @@ BERT
:members: :members:
``BertAdam`` ``AdamW``
~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~
.. autoclass:: pytorch_transformers.BertAdam .. autoclass:: pytorch_transformers.AdamW
:members: :members:
``BertModel`` ``BertModel``
......
...@@ -15,13 +15,6 @@ OpenAI GPT ...@@ -15,13 +15,6 @@ OpenAI GPT
:members: :members:
``OpenAIAdam``
~~~~~~~~~~~~~~~~~~
.. autoclass:: pytorch_transformers.OpenAIAdam
:members:
``OpenAIGPTModel`` ``OpenAIGPTModel``
~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~
......
...@@ -236,7 +236,7 @@ Learning Rate Schedules ...@@ -236,7 +236,7 @@ Learning Rate Schedules
The ``.optimization`` module also provides additional schedules in the form of schedule objects that inherit from ``_LRSchedule``. The ``.optimization`` module also provides additional schedules in the form of schedule objects that inherit from ``_LRSchedule``.
All ``_LRSchedule`` subclasses accept ``warmup`` and ``t_total`` arguments at construction. All ``_LRSchedule`` subclasses accept ``warmup`` and ``t_total`` arguments at construction.
When an ``_LRSchedule`` object is passed into ``BertAdam`` or ``OpenAIAdam``\ , When an ``_LRSchedule`` object is passed into ``AdamW``\ ,
the ``warmup`` and ``t_total`` arguments on the optimizer are ignored and the ones in the ``_LRSchedule`` object are used. the ``warmup`` and ``t_total`` arguments on the optimizer are ignored and the ones in the ``_LRSchedule`` object are used.
An overview of the implemented schedules: An overview of the implemented schedules:
......
...@@ -16,7 +16,7 @@ from tqdm import tqdm ...@@ -16,7 +16,7 @@ from tqdm import tqdm
from pytorch_transformers import WEIGHTS_NAME, CONFIG_NAME from pytorch_transformers import WEIGHTS_NAME, CONFIG_NAME
from pytorch_transformers.modeling_bert import BertForPreTraining from pytorch_transformers.modeling_bert import BertForPreTraining
from pytorch_transformers.tokenization_bert import BertTokenizer from pytorch_transformers.tokenization_bert import BertTokenizer
from pytorch_transformers.optimization import BertAdam, WarmupLinearSchedule from pytorch_transformers.optimization import AdamW, WarmupLinearSchedule
InputFeatures = namedtuple("InputFeatures", "input_ids input_mask segment_ids lm_label_ids is_next") InputFeatures = namedtuple("InputFeatures", "input_ids input_mask segment_ids lm_label_ids is_next")
...@@ -273,7 +273,7 @@ def main(): ...@@ -273,7 +273,7 @@ def main():
warmup_linear = WarmupLinearSchedule(warmup=args.warmup_proportion, warmup_linear = WarmupLinearSchedule(warmup=args.warmup_proportion,
t_total=num_train_optimization_steps) t_total=num_train_optimization_steps)
else: else:
optimizer = BertAdam(optimizer_grouped_parameters, optimizer = AdamW(optimizer_grouped_parameters,
lr=args.learning_rate, lr=args.learning_rate,
warmup=args.warmup_proportion, warmup=args.warmup_proportion,
t_total=num_train_optimization_steps) t_total=num_train_optimization_steps)
......
#!/usr/bin/env python3 #!/usr/bin/env python3
# Copyright 2018 CMU and The HuggingFace Inc. team.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
""" Bertology: this script shows how you can explore the internals of the models in the library to:
- compute the entropy of the head attentions
- compute the importance of each head
- prune (remove) the low importance head.
Some parts of this script are adapted from the code of Michel et al. (http://arxiv.org/abs/1905.10650)
which is available at https://github.com/pmichel31415/are-16-heads-really-better-than-1
"""
import os import os
import argparse import argparse
import logging import logging
...@@ -12,43 +32,49 @@ from torch.utils.data import DataLoader, SequentialSampler, TensorDataset, Subse ...@@ -12,43 +32,49 @@ from torch.utils.data import DataLoader, SequentialSampler, TensorDataset, Subse
from torch.utils.data.distributed import DistributedSampler from torch.utils.data.distributed import DistributedSampler
from torch.nn import CrossEntropyLoss, MSELoss from torch.nn import CrossEntropyLoss, MSELoss
from pytorch_transformers import BertForSequenceClassification, BertTokenizer from pytorch_transformers import (WEIGHTS_NAME,
BertConfig, BertForSequenceClassification, BertTokenizer,
XLMConfig, XLMForSequenceClassification, XLMTokenizer,
XLNetConfig, XLNetForSequenceClassification, XLNetTokenizer)
from utils_glue import processors, output_modes, convert_examples_to_features, compute_metrics from run_glue import set_seed, load_and_cache_examples, ALL_MODELS, MODEL_CLASSES
from utils_glue import (compute_metrics, convert_examples_to_features,
output_modes, processors)
logger = logging.getLogger(__name__) logger = logging.getLogger(__name__)
def entropy(p): def entropy(p):
""" Compute the entropy of a probability distribution """
plogp = p * torch.log(p) plogp = p * torch.log(p)
plogp[p == 0] = 0 plogp[p == 0] = 0
return -plogp.sum(dim=-1) return -plogp.sum(dim=-1)
def print_1d_tensor(tensor, prefix=""):
if tensor.dtype != torch.long:
logger.info(prefix + "\t".join(f"{x:.5f}" for x in tensor.cpu().data))
else:
logger.info(prefix + "\t".join(f"{x:d}" for x in tensor.cpu().data))
def print_2d_tensor(tensor): def print_2d_tensor(tensor):
""" Print a 2D tensor """
logger.info("lv, h >\t" + "\t".join(f"{x + 1}" for x in range(len(tensor)))) logger.info("lv, h >\t" + "\t".join(f"{x + 1}" for x in range(len(tensor))))
for row in range(len(tensor)): for row in range(len(tensor)):
print_1d_tensor(tensor[row], prefix=f"layer {row + 1}:\t") if tensor.dtype != torch.long:
logger.info(f"layer {row + 1}:\t" + "\t".join(f"{x:.5f}" for x in tensor[row].cpu().data))
else:
logger.info(f"layer {row + 1}:\t" + "\t".join(f"{x:d}" for x in tensor[row].cpu().data))
def compute_heads_importance(args, model, eval_dataloader, compute_entropy=True, compute_importance=True, head_mask=None): def compute_heads_importance(args, model, eval_dataloader, compute_entropy=True, compute_importance=True, head_mask=None):
""" Example on how to use model outputs to compute: """ This method shows how to compute:
- head attention entropy (activated by setting output_attentions=True when we created the model - head attention entropy
- head importance scores according to http://arxiv.org/abs/1905.10650 - head importance scores according to http://arxiv.org/abs/1905.10650
(activated by setting keep_multihead_output=True when we created the model)
""" """
# Prepare our tensors # Prepare our tensors
n_layers, n_heads = model.bert.config.num_hidden_layers, model.bert.config.num_attention_heads n_layers, n_heads = model.bert.config.num_hidden_layers, model.bert.config.num_attention_heads
head_importance = torch.zeros(n_layers, n_heads).to(args.device) head_importance = torch.zeros(n_layers, n_heads).to(args.device)
attn_entropy = torch.zeros(n_layers, n_heads).to(args.device) attn_entropy = torch.zeros(n_layers, n_heads).to(args.device)
if head_mask is None:
head_mask = torch.ones(n_layers, n_heads).to(args.device)
head_mask.requires_grad_(requires_grad=True)
preds = None preds = None
labels = None labels = None
tot_tokens = 0.0 tot_tokens = 0.0
...@@ -58,29 +84,17 @@ def compute_heads_importance(args, model, eval_dataloader, compute_entropy=True, ...@@ -58,29 +84,17 @@ def compute_heads_importance(args, model, eval_dataloader, compute_entropy=True,
input_ids, input_mask, segment_ids, label_ids = batch input_ids, input_mask, segment_ids, label_ids = batch
# Do a forward pass (not with torch.no_grad() since we need gradients for importance score - see below) # Do a forward pass (not with torch.no_grad() since we need gradients for importance score - see below)
all_attentions, logits = model(input_ids, token_type_ids=segment_ids, attention_mask=input_mask, head_mask=head_mask) outputs = model(input_ids, token_type_ids=segment_ids, attention_mask=input_mask, labels=label_ids, head_mask=head_mask)
loss, logits, all_attentions = outputs[0], outputs[1], outputs[-1] # Loss and logits are the first, attention the last
loss.backward() # Backpropagate to populate the gradients in the head mask
if compute_entropy: if compute_entropy:
# Update head attention entropy
for layer, attn in enumerate(all_attentions): for layer, attn in enumerate(all_attentions):
masked_entropy = entropy(attn.detach()) * input_mask.float().unsqueeze(1) masked_entropy = entropy(attn.detach()) * input_mask.float().unsqueeze(1)
attn_entropy[layer] += masked_entropy.sum(-1).sum(0).detach() attn_entropy[layer] += masked_entropy.sum(-1).sum(0).detach()
if compute_importance: if compute_importance:
# Update head importance scores with regards to our loss head_importance += head_mask.grad.abs().detach()
# First, backpropagate to populate the gradients
if args.output_mode == "classification":
loss_fct = CrossEntropyLoss()
loss = loss_fct(logits.view(-1, args.num_labels), label_ids.view(-1))
elif args.output_mode == "regression":
loss_fct = MSELoss()
loss = loss_fct(logits.view(-1), label_ids.view(-1))
loss.backward()
# Second, compute importance scores according to http://arxiv.org/abs/1905.10650
multihead_outputs = model.bert.get_multihead_outputs()
for layer, mh_layer_output in enumerate(multihead_outputs):
dot = torch.einsum("bhli,bhli->bhl", [mh_layer_output.grad, mh_layer_output])
head_importance[layer] += dot.abs().sum(-1).sum(0).detach()
# Also store our logits/labels if we want to compute metrics afterwards # Also store our logits/labels if we want to compute metrics afterwards
if preds is None: if preds is None:
...@@ -104,130 +118,6 @@ def compute_heads_importance(args, model, eval_dataloader, compute_entropy=True, ...@@ -104,130 +118,6 @@ def compute_heads_importance(args, model, eval_dataloader, compute_entropy=True,
if not args.dont_normalize_global_importance: if not args.dont_normalize_global_importance:
head_importance = (head_importance - head_importance.min()) / (head_importance.max() - head_importance.min()) head_importance = (head_importance - head_importance.min()) / (head_importance.max() - head_importance.min())
return attn_entropy, head_importance, preds, labels
def run_model():
parser = argparse.ArgumentParser()
parser.add_argument('--model_name_or_path', type=str, default='bert-base-cased-finetuned-mrpc', help='pretrained model name or path to local checkpoint')
parser.add_argument("--task_name", type=str, default='mrpc', help="The name of the task to train.")
parser.add_argument("--data_dir", type=str, required=True, help="The input data dir. Should contain the .tsv files (or other data files) for the task.")
parser.add_argument("--output_dir", type=str, required=True, help="The output directory where the model predictions and checkpoints will be written.")
parser.add_argument("--data_subset", type=int, default=-1, help="If > 0: limit the data to a subset of data_subset instances.")
parser.add_argument("--overwrite_output_dir", action='store_true', help="Whether to overwrite data in output directory")
parser.add_argument("--dont_normalize_importance_by_layer", action='store_true', help="Don't normalize importance score by layers")
parser.add_argument("--dont_normalize_global_importance", action='store_true', help="Don't normalize all importance scores between 0 and 1")
parser.add_argument("--try_masking", action='store_true', help="Whether to try to mask head until a threshold of accuracy.")
parser.add_argument("--masking_threshold", default=0.9, type=float, help="masking threshold in term of metrics"
"(stop masking when metric < threshold * original metric value).")
parser.add_argument("--masking_amount", default=0.1, type=float, help="Amount to heads to masking at each masking step.")
parser.add_argument("--metric_name", default="acc", type=str, help="Metric to use for head masking.")
parser.add_argument("--max_seq_length", default=128, type=int, help="The maximum total input sequence length after WordPiece tokenization. \n"
"Sequences longer than this will be truncated, and sequences shorter \n"
"than this will be padded.")
parser.add_argument("--batch_size", default=1, type=int, help="Batch size.")
parser.add_argument("--seed", type=int, default=42)
parser.add_argument("--local_rank", type=int, default=-1, help="local_rank for distributed training on gpus")
parser.add_argument("--no_cuda", action='store_true', help="Whether not to use CUDA when available")
parser.add_argument('--server_ip', type=str, default='', help="Can be used for distant debugging.")
parser.add_argument('--server_port', type=str, default='', help="Can be used for distant debugging.")
args = parser.parse_args()
if args.server_ip and args.server_port:
# Distant debugging - see https://code.visualstudio.com/docs/python/debugging#_attach-to-a-local-script
import ptvsd
print("Waiting for debugger attach")
ptvsd.enable_attach(address=(args.server_ip, args.server_port), redirect_output=True)
ptvsd.wait_for_attach()
# Setup devices and distributed training
if args.local_rank == -1 or args.no_cuda:
args.device = torch.device("cuda" if torch.cuda.is_available() and not args.no_cuda else "cpu")
n_gpu = torch.cuda.device_count()
else:
torch.cuda.set_device(args.local_rank)
args.device = torch.device("cuda", args.local_rank)
n_gpu = 1
torch.distributed.init_process_group(backend='nccl') # Initializes the distributed backend
# Setup logging
logging.basicConfig(level = logging.INFO if args.local_rank in [-1, 0] else logging.WARN)
logger.info("device: {} n_gpu: {}, distributed: {}".format(args.device, n_gpu, bool(args.local_rank != -1)))
# Set seeds
np.random.seed(args.seed)
torch.random.manual_seed(args.seed)
if n_gpu > 0:
torch.cuda.manual_seed(args.seed)
# Prepare GLUE task
task_name = args.task_name.lower()
processor = processors[task_name]()
label_list = processor.get_labels()
args.output_mode = output_modes[task_name]
args.num_labels = len(label_list)
# Prepare output directory
if os.path.exists(args.output_dir) and os.listdir(args.output_dir) and not args.overwrite_output_dir:
raise ValueError("Output directory ({}) already exists and is not empty.".format(args.output_dir))
if not os.path.exists(args.output_dir) and args.local_rank in [-1, 0]:
os.makedirs(args.output_dir)
# Load model & tokenizer
if args.local_rank not in [-1, 0]:
torch.distributed.barrier() # Make sure only one distributed process download model & vocab
tokenizer = BertTokenizer.from_pretrained(args.model_name_or_path)
# Load a model with all BERTology options on:
# output_attentions => will output attention weights
# keep_multihead_output => will store gradient of attention head outputs for head importance computation
# see: http://arxiv.org/abs/1905.10650
model = BertForSequenceClassification.from_pretrained(args.model_name_or_path,
num_labels=args.num_labels,
output_attentions=True,
keep_multihead_output=True)
if args.local_rank == 0:
torch.distributed.barrier() # Make sure only one distributed process download model & vocab
model.to(args.device)
if args.local_rank != -1:
model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.local_rank], output_device=args.local_rank, find_unused_parameters=True)
model.eval()
# Prepare dataset for the GLUE task
eval_examples = processor.get_dev_examples(args.data_dir)
cached_eval_features_file = os.path.join(args.data_dir, 'dev_{0}_{1}_{2}'.format(
list(filter(None, args.model_name_or_path.split('/'))).pop(), str(args.max_seq_length), str(task_name)))
try:
eval_features = torch.load(cached_eval_features_file)
except:
eval_features = convert_examples_to_features(eval_examples, label_list, args.max_seq_length, tokenizer, args.output_mode)
if args.local_rank in [-1, 0]:
logger.info("Saving eval features to cache file %s", cached_eval_features_file)
torch.save(eval_features, cached_eval_features_file)
all_input_ids = torch.tensor([f.input_ids for f in eval_features], dtype=torch.long)
all_input_mask = torch.tensor([f.input_mask for f in eval_features], dtype=torch.long)
all_segment_ids = torch.tensor([f.segment_ids for f in eval_features], dtype=torch.long)
all_label_ids = torch.tensor([f.label_id for f in eval_features], dtype=torch.long if args.output_mode == "classification" else torch.float)
eval_data = TensorDataset(all_input_ids, all_input_mask, all_segment_ids, all_label_ids)
if args.data_subset > 0:
eval_data = Subset(eval_data, list(range(min(args.data_subset, len(eval_data)))))
eval_sampler = SequentialSampler(eval_data) if args.local_rank == -1 else DistributedSampler(eval_data)
eval_dataloader = DataLoader(eval_data, sampler=eval_sampler, batch_size=args.batch_size)
# Print/save training arguments
print(args)
torch.save(args, os.path.join(args.output_dir, 'run_args.bin'))
# Compute head entropy and importance score
attn_entropy, head_importance, _, _ = compute_heads_importance(args, model, eval_dataloader)
# Print/save matrices # Print/save matrices
np.save(os.path.join(args.output_dir, 'attn_entropy.npy'), attn_entropy.detach().cpu().numpy()) np.save(os.path.join(args.output_dir, 'attn_entropy.npy'), attn_entropy.detach().cpu().numpy())
np.save(os.path.join(args.output_dir, 'head_importance.npy'), head_importance.detach().cpu().numpy()) np.save(os.path.join(args.output_dir, 'head_importance.npy'), head_importance.detach().cpu().numpy())
...@@ -242,11 +132,16 @@ def run_model(): ...@@ -242,11 +132,16 @@ def run_model():
head_ranks = head_ranks.view_as(head_importance) head_ranks = head_ranks.view_as(head_importance)
print_2d_tensor(head_ranks) print_2d_tensor(head_ranks)
# Do masking if we want to return attn_entropy, head_importance, preds, labels
if args.try_masking and args.masking_threshold > 0.0 and args.masking_threshold < 1.0:
def mask_heads(args, model, eval_dataloader):
""" This method shows how to mask head (set some heads to zero), to test the effect on the network,
based on the head importance scores, as described in Michel et al. (http://arxiv.org/abs/1905.10650)
"""
_, head_importance, preds, labels = compute_heads_importance(args, model, eval_dataloader, compute_entropy=False) _, head_importance, preds, labels = compute_heads_importance(args, model, eval_dataloader, compute_entropy=False)
preds = np.argmax(preds, axis=1) if args.output_mode == "classification" else np.squeeze(preds) preds = np.argmax(preds, axis=1) if args.output_mode == "classification" else np.squeeze(preds)
original_score = compute_metrics(task_name, preds, labels)[args.metric_name] original_score = compute_metrics(args.task_name, preds, labels)[args.metric_name]
logger.info("Pruning: original score: %f, threshold: %f", original_score, original_score * args.masking_threshold) logger.info("Pruning: original score: %f, threshold: %f", original_score, original_score * args.masking_threshold)
new_head_mask = torch.ones_like(head_importance) new_head_mask = torch.ones_like(head_importance)
...@@ -273,38 +168,179 @@ def run_model(): ...@@ -273,38 +168,179 @@ def run_model():
# Compute metric and head importance again # Compute metric and head importance again
_, head_importance, preds, labels = compute_heads_importance(args, model, eval_dataloader, compute_entropy=False, head_mask=new_head_mask) _, head_importance, preds, labels = compute_heads_importance(args, model, eval_dataloader, compute_entropy=False, head_mask=new_head_mask)
preds = np.argmax(preds, axis=1) if args.output_mode == "classification" else np.squeeze(preds) preds = np.argmax(preds, axis=1) if args.output_mode == "classification" else np.squeeze(preds)
current_score = compute_metrics(task_name, preds, labels)[args.metric_name] current_score = compute_metrics(args.task_name, preds, labels)[args.metric_name]
logger.info("Masking: current score: %f, remaning heads %d (%.1f percents)", current_score, new_head_mask.sum(), new_head_mask.sum()/new_head_mask.numel() * 100) logger.info("Masking: current score: %f, remaning heads %d (%.1f percents)", current_score, new_head_mask.sum(), new_head_mask.sum()/new_head_mask.numel() * 100)
logger.info("Final head mask") logger.info("Final head mask")
print_2d_tensor(head_mask) print_2d_tensor(head_mask)
np.save(os.path.join(args.output_dir, 'head_mask.npy'), head_mask.detach().cpu().numpy()) np.save(os.path.join(args.output_dir, 'head_mask.npy'), head_mask.detach().cpu().numpy())
return head_mask
def prune_heads(args, model, eval_dataloader, head_mask):
""" This method shows how to prune head (remove heads weights) based on
the head importance scores as described in Michel et al. (http://arxiv.org/abs/1905.10650)
"""
# Try pruning and test time speedup # Try pruning and test time speedup
# Pruning is like masking but we actually remove the masked weights # Pruning is like masking but we actually remove the masked weights
before_time = datetime.now() before_time = datetime.now()
_, _, preds, labels = compute_heads_importance(args, model, eval_dataloader, _, _, preds, labels = compute_heads_importance(args, model, eval_dataloader,
compute_entropy=False, compute_importance=False, head_mask=head_mask) compute_entropy=False, compute_importance=False, head_mask=head_mask)
preds = np.argmax(preds, axis=1) if args.output_mode == "classification" else np.squeeze(preds) preds = np.argmax(preds, axis=1) if args.output_mode == "classification" else np.squeeze(preds)
score_masking = compute_metrics(task_name, preds, labels)[args.metric_name] score_masking = compute_metrics(args.task_name, preds, labels)[args.metric_name]
original_time = datetime.now() - before_time original_time = datetime.now() - before_time
original_num_params = sum(p.numel() for p in model.parameters()) original_num_params = sum(p.numel() for p in model.parameters())
heads_to_prune = dict((layer, (1 - head_mask[layer].long()).nonzero().tolist()) for layer in range(len(head_mask))) heads_to_prune = dict((layer, (1 - head_mask[layer].long()).nonzero().tolist()) for layer in range(len(head_mask)))
assert sum(len(h) for h in heads_to_prune.values()) == (1 - head_mask.long()).sum().item() assert sum(len(h) for h in heads_to_prune.values()) == (1 - head_mask.long()).sum().item()
model.bert.prune_heads(heads_to_prune) model.prune_heads(heads_to_prune)
pruned_num_params = sum(p.numel() for p in model.parameters()) pruned_num_params = sum(p.numel() for p in model.parameters())
before_time = datetime.now() before_time = datetime.now()
_, _, preds, labels = compute_heads_importance(args, model, eval_dataloader, _, _, preds, labels = compute_heads_importance(args, model, eval_dataloader,
compute_entropy=False, compute_importance=False, head_mask=None) compute_entropy=False, compute_importance=False, head_mask=None)
preds = np.argmax(preds, axis=1) if args.output_mode == "classification" else np.squeeze(preds) preds = np.argmax(preds, axis=1) if args.output_mode == "classification" else np.squeeze(preds)
score_pruning = compute_metrics(task_name, preds, labels)[args.metric_name] score_pruning = compute_metrics(args.task_name, preds, labels)[args.metric_name]
new_time = datetime.now() - before_time new_time = datetime.now() - before_time
logger.info("Pruning: original num of params: %.2e, after pruning %.2e (%.1f percents)", original_num_params, pruned_num_params, pruned_num_params/original_num_params * 100) logger.info("Pruning: original num of params: %.2e, after pruning %.2e (%.1f percents)", original_num_params, pruned_num_params, pruned_num_params/original_num_params * 100)
logger.info("Pruning: score with masking: %f score with pruning: %f", score_masking, score_pruning) logger.info("Pruning: score with masking: %f score with pruning: %f", score_masking, score_pruning)
logger.info("Pruning: speed ratio (new timing / original timing): %f percents", original_time/new_time * 100) logger.info("Pruning: speed ratio (new timing / original timing): %f percents", original_time/new_time * 100)
def main():
parser = argparse.ArgumentParser()
parser.add_argument("--data_dir", default=None, type=str, required=True,
help="The input data dir. Should contain the .tsv files (or other data files) for the task.")
parser.add_argument("--model_name", default=None, type=str, required=True,
help="Bert/XLNet/XLM pre-trained model selected in the list: " + ", ".join(ALL_MODELS))
parser.add_argument("--task_name", default=None, type=str, required=True,
help="The name of the task to train selected in the list: " + ", ".join(processors.keys()))
parser.add_argument("--output_dir", default=None, type=str, required=True,
help="The output directory where the model predictions and checkpoints will be written.")
## Other parameters
parser.add_argument("--config_name", default="", type=str,
help="Pretrained config name or path if not the same as model_name")
parser.add_argument("--tokenizer_name", default="", type=str,
help="Pretrained tokenizer name or path if not the same as model_name")
parser.add_argument("--cache_dir", default="", type=str,
help="Where do you want to store the pre-trained models downloaded from s3")
parser.add_argument("--data_subset", type=int, default=-1,
help="If > 0: limit the data to a subset of data_subset instances.")
parser.add_argument("--overwrite_output_dir", action='store_true',
help="Whether to overwrite data in output directory")
parser.add_argument("--dont_normalize_importance_by_layer", action='store_true',
help="Don't normalize importance score by layers")
parser.add_argument("--dont_normalize_global_importance", action='store_true',
help="Don't normalize all importance scores between 0 and 1")
parser.add_argument("--try_masking", action='store_true',
help="Whether to try to mask head until a threshold of accuracy.")
parser.add_argument("--masking_threshold", default=0.9, type=float,
help="masking threshold in term of metrics (stop masking when metric < threshold * original metric value).")
parser.add_argument("--masking_amount", default=0.1, type=float,
help="Amount to heads to masking at each masking step.")
parser.add_argument("--metric_name", default="acc", type=str,
help="Metric to use for head masking.")
parser.add_argument("--max_seq_length", default=128, type=int,
help="The maximum total input sequence length after WordPiece tokenization. \n"
"Sequences longer than this will be truncated, sequences shorter padded.")
parser.add_argument("--batch_size", default=1, type=int, help="Batch size.")
parser.add_argument("--seed", type=int, default=42)
parser.add_argument("--local_rank", type=int, default=-1, help="local_rank for distributed training on gpus")
parser.add_argument("--no_cuda", action='store_true', help="Whether not to use CUDA when available")
parser.add_argument('--server_ip', type=str, default='', help="Can be used for distant debugging.")
parser.add_argument('--server_port', type=str, default='', help="Can be used for distant debugging.")
args = parser.parse_args()
if args.server_ip and args.server_port:
# Distant debugging - see https://code.visualstudio.com/docs/python/debugging#_attach-to-a-local-script
import ptvsd
print("Waiting for debugger attach")
ptvsd.enable_attach(address=(args.server_ip, args.server_port), redirect_output=True)
ptvsd.wait_for_attach()
# Setup devices and distributed training
if args.local_rank == -1 or args.no_cuda:
args.device = torch.device("cuda" if torch.cuda.is_available() and not args.no_cuda else "cpu")
args.n_gpu = torch.cuda.device_count()
else:
torch.cuda.set_device(args.local_rank)
args.device = torch.device("cuda", args.local_rank)
args.n_gpu = 1
torch.distributed.init_process_group(backend='nccl') # Initializes the distributed backend
# Setup logging
logging.basicConfig(level = logging.INFO if args.local_rank in [-1, 0] else logging.WARN)
logger.info("device: {} n_gpu: {}, distributed: {}".format(args.device, args.n_gpu, bool(args.local_rank != -1)))
# Set seeds
set_seed(args)
# Prepare GLUE task
args.task_name = args.task_name.lower()
if args.task_name not in processors:
raise ValueError("Task not found: %s" % (args.task_name))
processor = processors[args.task_name]()
args.output_mode = output_modes[args.task_name]
label_list = processor.get_labels()
num_labels = len(label_list)
# Load pretrained model and tokenizer
if args.local_rank not in [-1, 0]:
torch.distributed.barrier() # Make sure only the first process in distributed training will download model & vocab
args.model_type = ""
for key in MODEL_CLASSES:
if key in args.model_name.lower():
args.model_type = key # take the first match in model types
break
config_class, model_class, tokenizer_class = MODEL_CLASSES[args.model_type]
config = config_class.from_pretrained(args.config_name if args.config_name else args.model_name,
num_labels=num_labels, finetuning_task=args.task_name,
output_attentions=True)
tokenizer = tokenizer_class.from_pretrained(args.tokenizer_name if args.tokenizer_name else args.model_name)
model = model_class.from_pretrained(args.model_name, from_tf=bool('.ckpt' in args.model_name), config=config)
if args.local_rank == 0:
torch.distributed.barrier() # Make sure only the first process in distributed training will download model & vocab
# Distributed and parallel training
model.to(args.device)
if args.local_rank != -1:
model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.local_rank],
output_device=args.local_rank,
find_unused_parameters=True)
elif args.n_gpu > 1:
model = torch.nn.DataParallel(model)
# Print/save training arguments
torch.save(args, os.path.join(args.output_dir, 'run_args.bin'))
logger.info("Training/evaluation parameters %s", args)
# Prepare dataset for the GLUE task
eval_data = load_and_cache_examples(args, args.task_name, tokenizer, evaluate=True)
if args.data_subset > 0:
eval_data = Subset(eval_data, list(range(min(args.data_subset, len(eval_data)))))
eval_sampler = SequentialSampler(eval_data) if args.local_rank == -1 else DistributedSampler(eval_data)
eval_dataloader = DataLoader(eval_data, sampler=eval_sampler, batch_size=args.batch_size)
# Compute head entropy and importance score
compute_heads_importance(args, model, eval_dataloader)
# Try head masking (set heads to zero until the score goes under a threshole)
# and head pruning (remove masked heads and see the effect on the network)
if args.try_masking and args.masking_threshold > 0.0 and args.masking_threshold < 1.0:
head_mask = mask_heads(args, model, eval_dataloader)
prune_heads(args, model, eval_dataloader, head_mask)
if __name__ == '__main__': if __name__ == '__main__':
run_model() main()
...@@ -14,7 +14,7 @@ ...@@ -14,7 +14,7 @@
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and # See the License for the specific language governing permissions and
# limitations under the License. # limitations under the License.
""" Generation with GPT/GPT-2/Transformer-XL/XLNet models """ Conditional text generation with the auto-regressive models of the library (GPT/GPT-2/Transformer-XL/XLNet)
""" """
from __future__ import absolute_import, division, print_function, unicode_literals from __future__ import absolute_import, division, print_function, unicode_literals
......
...@@ -13,7 +13,7 @@ ...@@ -13,7 +13,7 @@
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and # See the License for the specific language governing permissions and
# limitations under the License. # limitations under the License.
""" Finetuning a classification model (Bert, XLM, XLNet,...) on GLUE.""" """ Finetuning the library models for sequence classification on GLUE (Bert, XLM, XLNet)."""
from __future__ import absolute_import, division, print_function from __future__ import absolute_import, division, print_function
...@@ -230,6 +230,9 @@ def evaluate(args, model, tokenizer, prefix=""): ...@@ -230,6 +230,9 @@ def evaluate(args, model, tokenizer, prefix=""):
logger.info(" %s = %s", key, str(result[key])) logger.info(" %s = %s", key, str(result[key]))
writer.write("%s = %s\n" % (key, str(result[key]))) writer.write("%s = %s\n" % (key, str(result[key])))
if args.local_rank in [-1, 0]:
tb_writer.close()
return results return results
...@@ -242,7 +245,7 @@ def load_and_cache_examples(args, task, tokenizer, evaluate=False): ...@@ -242,7 +245,7 @@ def load_and_cache_examples(args, task, tokenizer, evaluate=False):
list(filter(None, args.model_name.split('/'))).pop(), list(filter(None, args.model_name.split('/'))).pop(),
str(args.max_seq_length), str(args.max_seq_length),
str(task))) str(task)))
if os.path.exists(cached_features_file) and not args.overwrite_cache: if os.path.exists(cached_features_file):
logger.info("Loading features from cached file %s", cached_features_file) logger.info("Loading features from cached file %s", cached_features_file)
features = torch.load(cached_features_file) features = torch.load(cached_features_file)
else: else:
...@@ -410,7 +413,7 @@ def main(): ...@@ -410,7 +413,7 @@ def main():
if args.local_rank == 0: if args.local_rank == 0:
torch.distributed.barrier() # Make sure only the first process in distributed training will download model & vocab torch.distributed.barrier() # Make sure only the first process in distributed training will download model & vocab
# Distributed and parrallel training # Distributed and parallel training
model.to(args.device) model.to(args.device)
if args.local_rank != -1: if args.local_rank != -1:
model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.local_rank], model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.local_rank],
......
...@@ -13,7 +13,7 @@ ...@@ -13,7 +13,7 @@
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and # See the License for the specific language governing permissions and
# limitations under the License. # limitations under the License.
""" Finetuning a question-answering model (Bert, XLM, XLNet,...) on SQuAD.""" """ Finetuning the library models for question-answering on SQuAD (Bert, XLM, XLNet)."""
from __future__ import absolute_import, division, print_function from __future__ import absolute_import, division, print_function
...@@ -21,7 +21,7 @@ import argparse ...@@ -21,7 +21,7 @@ import argparse
import logging import logging
import os import os
import random import random
from io import open import glob
import numpy as np import numpy as np
import torch import torch
...@@ -43,6 +43,9 @@ from pytorch_transformers import AdamW, WarmupLinearSchedule ...@@ -43,6 +43,9 @@ from pytorch_transformers import AdamW, WarmupLinearSchedule
from utils_squad import read_squad_examples, convert_examples_to_features, RawResult, write_predictions from utils_squad import read_squad_examples, convert_examples_to_features, RawResult, write_predictions
# The follwing import is the official SQuAD evaluation script (2.0).
# You can remove it from the dependencies if you are using this script outside of the library
# We've added it here for automated tests (see examples/test_examples.py file)
from utils_squad_evaluate import EVAL_OPTS, main as evaluate_on_squad from utils_squad_evaluate import EVAL_OPTS, main as evaluate_on_squad
logger = logging.getLogger(__name__) logger = logging.getLogger(__name__)
...@@ -123,7 +126,7 @@ def train(args, train_dataset, model, tokenizer): ...@@ -123,7 +126,7 @@ def train(args, train_dataset, model, tokenizer):
loss = ouputs[0] # model outputs are always tuple in pytorch-transformers (see doc) loss = ouputs[0] # model outputs are always tuple in pytorch-transformers (see doc)
if args.n_gpu > 1: if args.n_gpu > 1:
loss = loss.mean() # mean() to average on multi-gpu parallel training loss = loss.mean() # mean() to average on multi-gpu parallel (not distributed) training
if args.gradient_accumulation_steps > 1: if args.gradient_accumulation_steps > 1:
loss = loss / args.gradient_accumulation_steps loss = loss / args.gradient_accumulation_steps
...@@ -169,6 +172,9 @@ def train(args, train_dataset, model, tokenizer): ...@@ -169,6 +172,9 @@ def train(args, train_dataset, model, tokenizer):
train_iterator.close() train_iterator.close()
break break
if args.local_rank in [-1, 0]:
tb_writer.close()
return global_step, tr_loss / global_step return global_step, tr_loss / global_step
...@@ -208,16 +214,16 @@ def evaluate(args, model, tokenizer, prefix=""): ...@@ -208,16 +214,16 @@ def evaluate(args, model, tokenizer, prefix=""):
start_logits=start_logits, start_logits=start_logits,
end_logits=end_logits)) end_logits=end_logits))
# Compute predictions
output_prediction_file = os.path.join(args.output_dir, "predictions_{}.json".format(prefix)) output_prediction_file = os.path.join(args.output_dir, "predictions_{}.json".format(prefix))
output_nbest_file = os.path.join(args.output_dir, "nbest_predictions_{}.json".format(prefix)) output_nbest_file = os.path.join(args.output_dir, "nbest_predictions_{}.json".format(prefix))
output_null_log_odds_file = os.path.join(args.output_dir, "null_odds_{}.json".format(prefix)) output_null_log_odds_file = os.path.join(args.output_dir, "null_odds_{}.json".format(prefix))
all_predictions = write_predictions(examples, features, all_results, write_predictions(examples, features, all_results, args.n_best_size, args.max_answer_length,
args.n_best_size, args.max_answer_length, args.do_lower_case, output_prediction_file, output_nbest_file,
args.do_lower_case, output_prediction_file, output_null_log_odds_file, args.verbose_logging,
output_nbest_file, output_null_log_odds_file, args.version_2_with_negative, args.null_score_diff_threshold)
args.verbose_logging, args.version_2_with_negative,
args.null_score_diff_threshold)
# Evaluate with the official SQuAD script
evaluate_options = EVAL_OPTS(data_file=args.predict_file, evaluate_options = EVAL_OPTS(data_file=args.predict_file,
pred_file=output_prediction_file, pred_file=output_prediction_file,
na_prob_file=output_null_log_odds_file) na_prob_file=output_null_log_odds_file)
...@@ -432,7 +438,7 @@ def main(): ...@@ -432,7 +438,7 @@ def main():
logger.info(" global_step = %s, average loss = %s", global_step, tr_loss) logger.info(" global_step = %s, average loss = %s", global_step, tr_loss)
# Saving best-practices: if you use defaults names for the model, you can reload it using from_pretrained() # Save the trained model and the tokenizer
if args.local_rank == -1 or torch.distributed.get_rank() == 0: if args.local_rank == -1 or torch.distributed.get_rank() == 0:
# Create output directory if needed # Create output directory if needed
if not os.path.exists(args.output_dir) and args.local_rank in [-1, 0]: if not os.path.exists(args.output_dir) and args.local_rank in [-1, 0]:
...@@ -454,22 +460,30 @@ def main(): ...@@ -454,22 +460,30 @@ def main():
model.to(args.device) model.to(args.device)
# Evaluation # Evaluation - we can ask to evaluate all the checkpoints (sub-directories) in a directory
results = {} results = {}
if args.do_eval and args.local_rank in [-1, 0]: if args.do_eval and args.local_rank in [-1, 0]:
checkpoints = [args.output_dir] checkpoints = [args.output_dir]
if args.eval_all_checkpoints: if args.eval_all_checkpoints:
checkpoints = list(os.path.dirname(c) for c in sorted(glob.glob(args.output_dir + '/**/' + WEIGHTS_NAME, recursive=True))) checkpoints = list(os.path.dirname(c) for c in sorted(glob.glob(args.output_dir + '/**/' + WEIGHTS_NAME, recursive=True)))
logging.getLogger("pytorch_transformers.modeling_utils").setLevel(logging.WARN) # Reduce logging logging.getLogger("pytorch_transformers.modeling_utils").setLevel(logging.WARN) # Reduce model loading logs
logger.info("Evaluate the following checkpoints: %s", checkpoints) logger.info("Evaluate the following checkpoints: %s", checkpoints)
for checkpoint in checkpoints: for checkpoint in checkpoints:
# Reload the model
global_step = checkpoint.split('-')[-1] if len(checkpoints) > 1 else "" global_step = checkpoint.split('-')[-1] if len(checkpoints) > 1 else ""
model = model_class.from_pretrained(checkpoint) model = model_class.from_pretrained(checkpoint)
model.to(args.device) model.to(args.device)
# Evaluate
result = evaluate(args, model, tokenizer, prefix=global_step) result = evaluate(args, model, tokenizer, prefix=global_step)
result = dict((k + ('_{}'.format(global_step) if global_step else ''), v) for k, v in result.items()) result = dict((k + ('_{}'.format(global_step) if global_step else ''), v) for k, v in result.items())
results.update(result) results.update(result)
logger.info("Results: {}".format(results)) logger.info("Results: {}".format(results))
return results return results
......
...@@ -40,7 +40,7 @@ from torch.utils.data import (DataLoader, RandomSampler, SequentialSampler, ...@@ -40,7 +40,7 @@ from torch.utils.data import (DataLoader, RandomSampler, SequentialSampler,
TensorDataset) TensorDataset)
from pytorch_transformers import (OpenAIGPTDoubleHeadsModel, OpenAIGPTTokenizer, from pytorch_transformers import (OpenAIGPTDoubleHeadsModel, OpenAIGPTTokenizer,
OpenAIAdam, cached_path, WEIGHTS_NAME, CONFIG_NAME) AdamW, cached_path, WEIGHTS_NAME, CONFIG_NAME)
ROCSTORIES_URL = "https://s3.amazonaws.com/datasets.huggingface.co/ROCStories.tar.gz" ROCSTORIES_URL = "https://s3.amazonaws.com/datasets.huggingface.co/ROCStories.tar.gz"
...@@ -191,7 +191,7 @@ def main(): ...@@ -191,7 +191,7 @@ def main():
{'params': [p for n, p in param_optimizer if any(nd in n for nd in no_decay)], 'weight_decay': 0.0} {'params': [p for n, p in param_optimizer if any(nd in n for nd in no_decay)], 'weight_decay': 0.0}
] ]
num_train_optimization_steps = len(train_dataloader) * args.num_train_epochs num_train_optimization_steps = len(train_dataloader) * args.num_train_epochs
optimizer = OpenAIAdam(optimizer_grouped_parameters, optimizer = AdamW(optimizer_grouped_parameters,
lr=args.learning_rate, lr=args.learning_rate,
warmup=args.warmup_proportion, warmup=args.warmup_proportion,
max_grad_norm=args.max_grad_norm, max_grad_norm=args.max_grad_norm,
......
...@@ -34,7 +34,7 @@ from tqdm import tqdm, trange ...@@ -34,7 +34,7 @@ from tqdm import tqdm, trange
from pytorch_transformers.file_utils import PYTORCH_PRETRAINED_BERT_CACHE, WEIGHTS_NAME, CONFIG_NAME from pytorch_transformers.file_utils import PYTORCH_PRETRAINED_BERT_CACHE, WEIGHTS_NAME, CONFIG_NAME
from pytorch_transformers.modeling_bert import BertForMultipleChoice, BertConfig from pytorch_transformers.modeling_bert import BertForMultipleChoice, BertConfig
from pytorch_transformers.optimization import BertAdam, WarmupLinearSchedule from pytorch_transformers.optimization import AdamW, WarmupLinearSchedule
from pytorch_transformers.tokenization_bert import BertTokenizer from pytorch_transformers.tokenization_bert import BertTokenizer
logging.basicConfig(format = '%(asctime)s - %(levelname)s - %(name)s - %(message)s', logging.basicConfig(format = '%(asctime)s - %(levelname)s - %(name)s - %(message)s',
......
...@@ -91,7 +91,6 @@ class ExamplesTests(unittest.TestCase): ...@@ -91,7 +91,6 @@ class ExamplesTests(unittest.TestCase):
self.assertGreaterEqual(result['f1'], 30) self.assertGreaterEqual(result['f1'], 30)
self.assertGreaterEqual(result['exact'], 30) self.assertGreaterEqual(result['exact'], 30)
def test_generation(self): def test_generation(self):
stream_handler = logging.StreamHandler(sys.stdout) stream_handler = logging.StreamHandler(sys.stdout)
logger.addHandler(stream_handler) logger.addHandler(stream_handler)
......
...@@ -13,7 +13,7 @@ ...@@ -13,7 +13,7 @@
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and # See the License for the specific language governing permissions and
# limitations under the License. # limitations under the License.
"""PyTorch BERT model.""" """PyTorch BERT model. """
from __future__ import absolute_import, division, print_function, unicode_literals from __future__ import absolute_import, division, print_function, unicode_literals
...@@ -28,7 +28,8 @@ import torch ...@@ -28,7 +28,8 @@ import torch
from torch import nn from torch import nn
from torch.nn import CrossEntropyLoss, MSELoss from torch.nn import CrossEntropyLoss, MSELoss
from .modeling_utils import WEIGHTS_NAME, CONFIG_NAME, PretrainedConfig, PreTrainedModel, prune_linear_layer from .modeling_utils import (WEIGHTS_NAME, CONFIG_NAME, PretrainedConfig, PreTrainedModel,
prune_linear_layer, add_start_docstrings)
logger = logging.getLogger(__name__) logger = logging.getLogger(__name__)
...@@ -66,7 +67,7 @@ BERT_PRETRAINED_CONFIG_ARCHIVE_MAP = { ...@@ -66,7 +67,7 @@ BERT_PRETRAINED_CONFIG_ARCHIVE_MAP = {
def load_tf_weights_in_bert(model, config, tf_checkpoint_path): def load_tf_weights_in_bert(model, config, tf_checkpoint_path):
""" Load tf checkpoints in a pytorch model """ Load tf checkpoints in a pytorch model.
""" """
try: try:
import re import re
...@@ -583,25 +584,84 @@ class BertPreTrainedModel(PreTrainedModel): ...@@ -583,25 +584,84 @@ class BertPreTrainedModel(PreTrainedModel):
module.bias.data.zero_() module.bias.data.zero_()
class BertModel(BertPreTrainedModel): BERT_START_DOCSTRING = r""" The BERT model was proposed in
r"""BERT model ("Bidirectional Embedding Representations from a Transformer"). `BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding`_
by Jacob Devlin, Ming-Wei Chang, Kenton Lee and Kristina Toutanova. It's a bidirectional transformer
:class:`~pytorch_transformers.BertModel` is the basic BERT Transformer model with a layer of summed token, \ pre-trained using a combination of masked language modeling objective and next sentence prediction
position and sequence embeddings followed by a series of identical self-attention blocks (12 for BERT-base, 24 \ on a large corpus comprising the Toronto Book Corpus and Wikipedia.
for BERT-large). The model is instantiated with the following parameters.
Arguments:
config: a BertConfig class instance with the configuration to build a new model
output_attentions: If True, also output attentions weights computed by the model at each layer. Default: False
output_hidden_states: If True, also output hidden states computed by the model at each layer. Default: Fals
This model is a PyTorch `torch.nn.Module`_ sub-class. Use it as a regular PyTorch Module and
refer to the PyTorch documentation for all matter related to general usage and behavior.
Example:: .. _`BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding`:
https://arxiv.org/abs/1810.04805
config = modeling.BertConfig(vocab_size_or_config_json_file=32000, hidden_size=768, .. _`torch.nn.Module`:
num_hidden_layers=12, num_attention_heads=12, intermediate_size=3072) https://pytorch.org/docs/stable/nn.html#module
model = modeling.BertModel(config=config) Parameters:
config (:class:`~pytorch_transformers.BertConfig`): Model configuration class with all the parameters of the model.
"""
BERT_INPUTS_DOCSTRING = r"""
Inputs:
**input_ids**: ``torch.LongTensor`` of shape ``(batch_size, sequence_length)``:
Indices of input sequence tokens in the vocabulary.
To match pre-training, BERT input sequence should be formatted with [CLS] and [SEP] tokens as follows:
(a) For sequence pairs:
``tokens: [CLS] is this jack ##son ##ville ? [SEP] no it is not . [SEP]``
``token_type_ids: 0 0 0 0 0 0 0 0 1 1 1 1 1 1``
(b) For single sequences:
``tokens: [CLS] the dog is hairy . [SEP]``
``token_type_ids: 0 0 0 0 0 0 0``
Indices can be obtained using :class:`pytorch_transformers.BertTokenizer`.
See :func:`pytorch_transformers.PreTrainedTokenizer.encode` and
:func:`pytorch_transformers.PreTrainedTokenizer.convert_tokens_to_ids` for details.
**token_type_ids**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size, sequence_length)``:
Segment token indices to indicate first and second portions of the inputs.
Indices are selected in ``[0, 1]``: ``0`` corresponds to a `sentence A` token, ``1``
corresponds to a `sentence B` token
(see `BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding`_ for more details).
**attention_mask**: (`optional`) ``torch.Tensor`` of shape ``(batch_size, sequence_length)``:
Mask to avoid performing attention on padding token indices.
Mask indices selected in ``[0, 1]``:
``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.
**head_mask**: (`optional`) ``torch.Tensor`` of shape ``(num_heads,)`` or ``(num_layers, num_heads)``:
Mask to nullify selected heads of the self-attention modules.
Mask indices selected in ``[0, 1]``:
``1`` indicates the head is **not masked**, ``0`` indicates the head is **masked**.
"""
@add_start_docstrings("The bare Bert Model transformer outputing raw hidden-states without any specific head on top.",
BERT_START_DOCSTRING, BERT_INPUTS_DOCSTRING)
class BertModel(BertPreTrainedModel):
r"""
Outputs: `Tuple` comprising various elements depending on the configuration (config) and inputs:
**last_hidden_state**: ``torch.FloatTensor`` of shape ``(batch_size, sequence_length, hidden_size)``
Sequence of hidden-states at the last layer of the model.
**attentions**: (`optional`, returned when ``config.output_attentions=True``)
list of ``torch.FloatTensor`` (one for each layer) of shape ``(batch_size, num_heads, sequence_length, sequence_length)``:
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
**hidden_states**: (`optional`, returned when ``config.output_hidden_states=True``)
list of ``torch.FloatTensor`` (one for the output of each layer + the output of the embeddings)
of shape ``(batch_size, sequence_length, hidden_size)``:
Hidden-states of the model at the output of each layer plus the initial embedding outputs.
Examples::
>>> config = BertConfig.from_pretrained('bert-base-uncased')
>>> tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
>>> model = BertModel(config)
>>> input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0) # Batch size 1
>>> outputs = model(input_ids)
>>> last_hidden_states = outputs[0] # The last hidden-state is the first element of the output tuple
""" """
def __init__(self, config): def __init__(self, config):
...@@ -628,58 +688,6 @@ class BertModel(BertPreTrainedModel): ...@@ -628,58 +688,6 @@ class BertModel(BertPreTrainedModel):
self.encoder.layer[layer].attention.prune_heads(heads) self.encoder.layer[layer].attention.prune_heads(heads)
def forward(self, input_ids, token_type_ids=None, attention_mask=None, head_mask=None): def forward(self, input_ids, token_type_ids=None, attention_mask=None, head_mask=None):
"""
Performs a model forward pass. **Can be called by calling the class directly, once it has been instantiated.**
Arguments:
input_ids: a ``torch.LongTensor`` of shape [batch_size, sequence_length] with the word token indices in the \
vocabulary(see the tokens pre-processing logic in the scripts `run_bert_extract_features.py`, \
`run_bert_classifier.py` and `run_bert_squad.py`)
token_type_ids: an optional ``torch.LongTensor`` of shape [batch_size, sequence_length] with the token \
types indices selected in [0, 1]. Type 0 corresponds to a `sentence A` and type 1 corresponds to \
a `sentence B` token (see BERT paper for more details).
attention_mask: an optional ``torch.LongTensor`` of shape [batch_size, sequence_length] with indices \
selected in [0, 1]. It's a mask to be used if the input sequence length is smaller than the max \
input sequence length in the current batch. It's the mask that we typically use for attention when \
a batch has varying length sentences.
output_all_encoded_layers: boolean which controls the content of the `encoded_layers` output as described \
below. Default: `True`.
head_mask: an optional ``torch.Tensor`` of shape [num_heads] or [num_layers, num_heads] with indices between 0 \
and 1. It's a mask to be used to nullify some heads of the transformer. 1.0 => head is fully masked, 0.0 \
=> head is not masked.
Returns:
A tuple composed of (encoded_layers, pooled_output). Encoded layers are controlled by the \
``output_all_encoded_layers`` argument.
If ``output_all_encoded_layers`` is set to True, outputs a list of the full sequences of \
encoded-hidden-states at the end of each attention \
block (i.e. 12 full sequences for BERT-base, 24 for BERT-large), each encoded-hidden-state is a\
``torch.FloatTensor`` of size [batch_size, sequence_length, hidden_size].
If set to False, outputs only the full sequence of hidden-states corresponding \
to the last attention block of shape [batch_size, sequence_length, hidden_size].
``pooled_output`` is a ``torch.FloatTensor`` of size [batch_size, hidden_size] which is the output of a \
classifier pretrained on top of the hidden state associated to the first character of the \
input (`CLS`) to train on the Next-Sentence task (see BERT's paper).
Example::
# Already been converted into WordPiece token ids
input_ids = torch.LongTensor([[31, 51, 99], [15, 5, 0]])
input_mask = torch.LongTensor([[1, 1, 1], [1, 1, 0]])
token_type_ids = torch.LongTensor([[0, 0, 1], [0, 1, 0]])
all_encoder_layers, pooled_output = model(input_ids, token_type_ids, input_mask)
# or
all_encoder_layers, pooled_output = model.forward(input_ids, token_type_ids, input_mask)
"""
if attention_mask is None: if attention_mask is None:
attention_mask = torch.ones_like(input_ids) attention_mask = torch.ones_like(input_ids)
if token_type_ids is None: if token_type_ids is None:
...@@ -726,25 +734,47 @@ class BertModel(BertPreTrainedModel): ...@@ -726,25 +734,47 @@ class BertModel(BertPreTrainedModel):
return outputs # sequence_output, pooled_output, (hidden_states), (attentions) return outputs # sequence_output, pooled_output, (hidden_states), (attentions)
@add_start_docstrings("""Bert Model transformer BERT model with two heads on top as done during the pre-training:
a `masked language modeling` head and a `next sentence prediction (classification)` head. """,
BERT_START_DOCSTRING, BERT_INPUTS_DOCSTRING)
class BertForPreTraining(BertPreTrainedModel): class BertForPreTraining(BertPreTrainedModel):
"""BERT model with pre-training heads. r"""
This module comprises the BERT model followed by the two pre-training heads: **masked_lm_labels**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size, sequence_length)``:
Labels for computing the masked language modeling loss.
- the masked language modeling head, and Indices should be in ``[-1, 0, ..., config.vocab_size]`` (see ``input_ids`` docstring)
Tokens with indices set to ``-1`` are ignored (masked), the loss is only computed for the tokens with labels
- the next sentence classification head. in ``[0, ..., config.vocab_size]``
**next_sentence_label**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size,)``:
Args: Labels for computing the next sequence prediction (classification) loss. Input should be a sequence pair (see ``input_ids`` docstring)
`config`: a BertConfig class instance with the configuration to build a new model Indices should be in ``[0, 1]``.
`output_attentions`: If True, also output attentions weights computed by the model at each layer. Default: False ``0`` indicates sequence B is a continuation of sequence A,
`output_hidden_states`: If True, also output hidden states computed by the model at each layer. Default: False ``1`` indicates sequence B is a random sequence.
Example :: Outputs: `Tuple` comprising various elements depending on the configuration (config) and inputs:
**loss**: (`optional`, returned when both ``masked_lm_labels`` and ``next_sentence_label`` are provided) ``torch.FloatTensor`` of shape ``(1,)``:
config = BertConfig(vocab_size_or_config_json_file=32000, hidden_size=768, Total loss as the sum of the masked language modeling loss and the next sequence prediction (classification) loss.
num_hidden_layers=12, num_attention_heads=12, intermediate_size=3072) **prediction_scores**: ``torch.FloatTensor`` of shape ``(batch_size, sequence_length, config.vocab_size)``
Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
**seq_relationship_scores**: ``torch.FloatTensor`` of shape ``(batch_size, sequence_length, 2)``
Prediction scores of the next sequence prediction (classification) head (scores of True/False continuation before SoftMax).
**attentions**: (`optional`, returned when ``config.output_attentions=True``)
list of ``torch.FloatTensor`` (one for each layer) of shape ``(batch_size, num_heads, sequence_length, sequence_length)``:
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
**hidden_states**: (`optional`, returned when ``config.output_hidden_states=True``)
list of ``torch.FloatTensor`` (one for the output of each layer + the output of the embeddings)
of shape ``(batch_size, sequence_length, hidden_size)``:
Hidden-states of the model at the output of each layer plus the initial embedding outputs.
Examples::
>>> config = BertConfig.from_pretrained('bert-base-uncased')
>>> tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
>>>
>>> model = BertForPreTraining(config)
>>> input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0) # Batch size 1
>>> outputs = model(input_ids)
>>> prediction_scores, seq_relationship_scores = outputs[:1]
model = BertForPreTraining(config)
""" """
def __init__(self, config): def __init__(self, config):
super(BertForPreTraining, self).__init__(config) super(BertForPreTraining, self).__init__(config)
...@@ -764,58 +794,6 @@ class BertForPreTraining(BertPreTrainedModel): ...@@ -764,58 +794,6 @@ class BertForPreTraining(BertPreTrainedModel):
def forward(self, input_ids, token_type_ids=None, attention_mask=None, masked_lm_labels=None, def forward(self, input_ids, token_type_ids=None, attention_mask=None, masked_lm_labels=None,
next_sentence_label=None, head_mask=None): next_sentence_label=None, head_mask=None):
"""
Performs a model forward pass. **Can be called by calling the class directly, once it has been instantiated.**
Args:
`input_ids`: a ``torch.LongTensor`` of shape [batch_size, sequence_length]
with the word token indices in the vocabulary(see the tokens preprocessing logic in the scripts
`run_bert_extract_features.py`, `run_bert_classifier.py` and `run_bert_squad.py`)
`token_type_ids`: an optional ``torch.LongTensor`` of shape [batch_size, sequence_length] with the token
types indices selected in [0, 1]. Type 0 corresponds to a `sentence A` and type 1 corresponds to
a `sentence B` token (see BERT paper for more details).
`attention_mask`: an optional ``torch.LongTensor`` of shape [batch_size, sequence_length] with indices
selected in [0, 1]. It's a mask to be used if the input sequence length is smaller than the max
input sequence length in the current batch. It's the mask that we typically use for attention when
a batch has varying length sentences.
`masked_lm_labels`: optional masked language modeling labels: ``torch.LongTensor`` of shape [batch_size, sequence_length]
with indices selected in [-1, 0, ..., vocab_size]. All labels set to -1 are ignored (masked), the loss
is only computed for the labels set in [0, ..., vocab_size]
`next_sentence_label`: optional next sentence classification loss: ``torch.LongTensor`` of shape [batch_size]
with indices selected in [0, 1].
0 => next sentence is the continuation, 1 => next sentence is a random sentence.
`head_mask`: an optional ``torch.Tensor`` of shape [num_heads] or [num_layers, num_heads] with indices between 0 and 1.
It's a mask to be used to nullify some heads of the transformer. 1.0 => head is fully masked, 0.0 => head is not masked.
Returns:
Either a ``torch.Tensor`` or ``tuple(torch.Tensor, torch.Tensor)``.
if ``masked_lm_labels`` and ``next_sentence_label`` are not ``None``, outputs the total_loss which is the \
sum of the masked language modeling loss and the next \
sentence classification loss.
if ``masked_lm_labels`` or ``next_sentence_label`` is ``None``, outputs a tuple made of:
- the masked language modeling logits of shape [batch_size, sequence_length, vocab_size]
- the next sentence classification logits of shape [batch_size, 2].
Example ::
# Already been converted into WordPiece token ids
input_ids = torch.LongTensor([[31, 51, 99], [15, 5, 0]])
input_mask = torch.LongTensor([[1, 1, 1], [1, 1, 0]])
token_type_ids = torch.LongTensor([[0, 0, 1], [0, 1, 0]])
config = BertConfig(vocab_size_or_config_json_file=32000, hidden_size=768,
num_hidden_layers=12, num_attention_heads=12, intermediate_size=3072)
model = BertForPreTraining(config)
masked_lm_logits_scores, seq_relationship_logits = model(input_ids, token_type_ids, input_mask)
# or
masked_lm_logits_scores, seq_relationship_logits = model.forward(input_ids, token_type_ids, input_mask)
"""
outputs = self.bert(input_ids, token_type_ids, attention_mask, head_mask=head_mask) outputs = self.bert(input_ids, token_type_ids, attention_mask, head_mask=head_mask)
sequence_output, pooled_output = outputs[:2] sequence_output, pooled_output = outputs[:2]
...@@ -833,21 +811,39 @@ class BertForPreTraining(BertPreTrainedModel): ...@@ -833,21 +811,39 @@ class BertForPreTraining(BertPreTrainedModel):
return outputs # (loss), prediction_scores, seq_relationship_score, (hidden_states), (attentions) return outputs # (loss), prediction_scores, seq_relationship_score, (hidden_states), (attentions)
@add_start_docstrings("""Bert Model transformer BERT model with a `language modeling` head on top. """,
BERT_START_DOCSTRING, BERT_INPUTS_DOCSTRING)
class BertForMaskedLM(BertPreTrainedModel): class BertForMaskedLM(BertPreTrainedModel):
"""BERT model with the masked language modeling head. r"""
This module comprises the BERT model followed by the masked language modeling head. **masked_lm_labels**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size, sequence_length)``:
Labels for computing the masked language modeling loss.
Args: Indices should be in ``[-1, 0, ..., config.vocab_size]`` (see ``input_ids`` docstring)
`config`: a BertConfig class instance with the configuration to build a new model Tokens with indices set to ``-1`` are ignored (masked), the loss is only computed for the tokens with labels
`output_attentions`: If True, also output attentions weights computed by the model at each layer. Default: False in ``[0, ..., config.vocab_size]``
`output_hidden_states`: If True, also output hidden states computed by the model at each layer. Default: False
Outputs: `Tuple` comprising various elements depending on the configuration (config) and inputs:
Example:: **loss**: (`optional`, returned when ``masked_lm_labels`` is provided) ``torch.FloatTensor`` of shape ``(1,)``:
Masked language modeling loss.
config = BertConfig(vocab_size_or_config_json_file=32000, hidden_size=768, **prediction_scores**: ``torch.FloatTensor`` of shape ``(batch_size, sequence_length, config.vocab_size)``
num_hidden_layers=12, num_attention_heads=12, intermediate_size=3072) Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
**attentions**: (`optional`, returned when ``config.output_attentions=True``)
list of ``torch.FloatTensor`` (one for each layer) of shape ``(batch_size, num_heads, sequence_length, sequence_length)``:
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
**hidden_states**: (`optional`, returned when ``config.output_hidden_states=True``)
list of ``torch.FloatTensor`` (one for the output of each layer + the output of the embeddings)
of shape ``(batch_size, sequence_length, hidden_size)``:
Hidden-states of the model at the output of each layer plus the initial embedding outputs.
Examples::
>>> config = BertConfig.from_pretrained('bert-base-uncased')
>>> tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
>>>
>>> model = BertForMaskedLM(config)
>>> input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0) # Batch size 1
>>> outputs = model(input_ids, masked_lm_labels=input_ids)
>>> loss, prediction_scores = outputs[:1]
model = BertForMaskedLM(config)
""" """
def __init__(self, config): def __init__(self, config):
super(BertForMaskedLM, self).__init__(config) super(BertForMaskedLM, self).__init__(config)
...@@ -866,45 +862,6 @@ class BertForMaskedLM(BertPreTrainedModel): ...@@ -866,45 +862,6 @@ class BertForMaskedLM(BertPreTrainedModel):
self.bert.embeddings.word_embeddings) self.bert.embeddings.word_embeddings)
def forward(self, input_ids, token_type_ids=None, attention_mask=None, masked_lm_labels=None, head_mask=None): def forward(self, input_ids, token_type_ids=None, attention_mask=None, masked_lm_labels=None, head_mask=None):
"""
Performs a model forward pass. **Can be called by calling the class directly, once it has been instantiated.**
Args:
`input_ids`: a ``torch.LongTensor`` of shape [batch_size, sequence_length]
with the word token indices in the vocabulary(see the tokens preprocessing logic in the scripts
`run_bert_extract_features.py`, `run_bert_classifier.py` and `run_bert_squad.py`)
`token_type_ids`: an optional ``torch.LongTensor`` of shape [batch_size, sequence_length] with the token
types indices selected in [0, 1]. Type 0 corresponds to a `sentence A` and type 1 corresponds to
a `sentence B` token (see BERT paper for more details).
`attention_mask`: an optional ``torch.LongTensor`` of shape [batch_size, sequence_length] with indices
selected in [0, 1]. It's a mask to be used if the input sequence length is smaller than the max
input sequence length in the current batch. It's the mask that we typically use for attention when
a batch has varying length sentences.
`masked_lm_labels`: masked language modeling labels: ``torch.LongTensor`` of shape [batch_size, sequence_length]
with indices selected in [-1, 0, ..., vocab_size]. All labels set to -1 are ignored (masked), the loss
is only computed for the labels set in [0, ..., vocab_size]
`head_mask`: an optional ``torch.LongTensor`` of shape [num_heads] with indices
selected in [0, 1]. It's a mask to be used if the input sequence length is smaller than the max
input sequence length in the current batch. It's the mask that we typically use for attention when
a batch has varying length sentences.
`head_mask`: an optional ``torch.Tensor`` of shape [num_heads] or [num_layers, num_heads] with indices between 0 and 1.
It's a mask to be used to nullify some heads of the transformer. 1.0 => head is fully masked, 0.0 => head is not masked.
Returns:
Masked language modeling loss if ``masked_lm_labels`` is specified, masked language modeling
logits of shape [batch_size, sequence_length, vocab_size] otherwise.
Example::
# Already been converted into WordPiece token ids
input_ids = torch.LongTensor([[31, 51, 99], [15, 5, 0]])
input_mask = torch.LongTensor([[1, 1, 1], [1, 1, 0]])
token_type_ids = torch.LongTensor([[0, 0, 1], [0, 1, 0]])
masked_lm_logits_scores = model(input_ids, token_type_ids, input_mask)
# or
masked_lm_logits_scores = model.forward(input_ids, token_type_ids, input_mask)
"""
outputs = self.bert(input_ids, token_type_ids, attention_mask, head_mask=head_mask) outputs = self.bert(input_ids, token_type_ids, attention_mask, head_mask=head_mask)
sequence_output = outputs[0] sequence_output = outputs[0]
...@@ -919,21 +876,39 @@ class BertForMaskedLM(BertPreTrainedModel): ...@@ -919,21 +876,39 @@ class BertForMaskedLM(BertPreTrainedModel):
return outputs # (masked_lm_loss), prediction_scores, (hidden_states), (attentions) return outputs # (masked_lm_loss), prediction_scores, (hidden_states), (attentions)
@add_start_docstrings("""Bert Model transformer BERT model with a `next sentence prediction (classification)` head on top. """,
BERT_START_DOCSTRING, BERT_INPUTS_DOCSTRING)
class BertForNextSentencePrediction(BertPreTrainedModel): class BertForNextSentencePrediction(BertPreTrainedModel):
"""BERT model with next sentence prediction head. r"""
This module comprises the BERT model followed by the next sentence classification head. **next_sentence_label**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size,)``:
Labels for computing the next sequence prediction (classification) loss. Input should be a sequence pair (see ``input_ids`` docstring)
Args: Indices should be in ``[0, 1]``.
`config`: a BertConfig class instance with the configuration to build a new model ``0`` indicates sequence B is a continuation of sequence A,
`output_attentions`: If True, also output attentions weights computed by the model at each layer. Default: False ``1`` indicates sequence B is a random sequence.
`output_hidden_states`: If True, also output hidden states computed by the model at each layer. Default: False
Outputs: `Tuple` comprising various elements depending on the configuration (config) and inputs:
Example:: **loss**: (`optional`, returned when ``next_sentence_label`` is provided) ``torch.FloatTensor`` of shape ``(1,)``:
Next sequence prediction (classification) loss.
config = BertConfig(vocab_size_or_config_json_file=32000, hidden_size=768, **seq_relationship_scores**: ``torch.FloatTensor`` of shape ``(batch_size, sequence_length, 2)``
num_hidden_layers=12, num_attention_heads=12, intermediate_size=3072) Prediction scores of the next sequence prediction (classification) head (scores of True/False continuation before SoftMax).
**attentions**: (`optional`, returned when ``config.output_attentions=True``)
list of ``torch.FloatTensor`` (one for each layer) of shape ``(batch_size, num_heads, sequence_length, sequence_length)``:
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
**hidden_states**: (`optional`, returned when ``config.output_hidden_states=True``)
list of ``torch.FloatTensor`` (one for the output of each layer + the output of the embeddings)
of shape ``(batch_size, sequence_length, hidden_size)``:
Hidden-states of the model at the output of each layer plus the initial embedding outputs.
Examples::
>>> config = BertConfig.from_pretrained('bert-base-uncased')
>>> tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
>>>
>>> model = BertForNextSentencePrediction(config)
>>> input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0) # Batch size 1
>>> outputs = model(input_ids)
>>> seq_relationship_scores = outputs[0]
model = BertForNextSentencePrediction(config)
""" """
def __init__(self, config): def __init__(self, config):
super(BertForNextSentencePrediction, self).__init__(config) super(BertForNextSentencePrediction, self).__init__(config)
...@@ -944,44 +919,6 @@ class BertForNextSentencePrediction(BertPreTrainedModel): ...@@ -944,44 +919,6 @@ class BertForNextSentencePrediction(BertPreTrainedModel):
self.apply(self.init_weights) self.apply(self.init_weights)
def forward(self, input_ids, token_type_ids=None, attention_mask=None, next_sentence_label=None, head_mask=None): def forward(self, input_ids, token_type_ids=None, attention_mask=None, next_sentence_label=None, head_mask=None):
"""
Performs a model forward pass. **Can be called by calling the class directly, once it has been instantiated.**
Args:
`input_ids`: a ``torch.LongTensor`` of shape [batch_size, sequence_length]
with the word token indices in the vocabulary(see the tokens pre-processing logic in the scripts
`run_bert_extract_features.py`, `run_bert_classifier.py` and `run_bert_squad.py`)
`token_type_ids`: an optional ``torch.LongTensor`` of shape [batch_size, sequence_length] with the token
types indices selected in [0, 1]. Type 0 corresponds to a `sentence A` and type 1 corresponds to
a `sentence B` token (see BERT paper for more details).
`attention_mask`: an optional ``torch.LongTensor`` of shape [batch_size, sequence_length] with indices
selected in [0, 1]. It's a mask to be used if the input sequence length is smaller than the max
input sequence length in the current batch. It's the mask that we typically use for attention when
a batch has varying length sentences.
`next_sentence_label`: next sentence classification loss: ``torch.LongTensor`` of shape [batch_size]
with indices selected in [0, 1].
0 => next sentence is the continuation, 1 => next sentence is a random sentence.
`head_mask`: an optional ``torch.Tensor`` of shape [num_heads] or [num_layers, num_heads] with indices between
0 and 1.It's a mask to be used to nullify some heads of the transformer. 1.0 => head is fully masked,
0.0 => head is not masked.
Returns:
If ``next_sentence_label`` is specified, outputs the total_loss which is the sum of the masked language
modeling loss and the next sentence classification loss. If ``next_sentence_label`` is ``None``, outputs
the next sentence classification logits of shape [batch_size, 2].
Example::
# Already been converted into WordPiece token ids
input_ids = torch.LongTensor([[31, 51, 99], [15, 5, 0]])
input_mask = torch.LongTensor([[1, 1, 1], [1, 1, 0]])
token_type_ids = torch.LongTensor([[0, 0, 1], [0, 1, 0]])
seq_relationship_logits = model(input_ids, token_type_ids, input_mask)
# or
seq_relationship_logits = model.forward(input_ids, token_type_ids, input_mask)
"""
outputs = self.bert(input_ids, token_type_ids, attention_mask, head_mask=head_mask) outputs = self.bert(input_ids, token_type_ids, attention_mask, head_mask=head_mask)
pooled_output = outputs[1] pooled_output = outputs[1]
...@@ -996,25 +933,41 @@ class BertForNextSentencePrediction(BertPreTrainedModel): ...@@ -996,25 +933,41 @@ class BertForNextSentencePrediction(BertPreTrainedModel):
return outputs # (next_sentence_loss), seq_relationship_score, (hidden_states), (attentions) return outputs # (next_sentence_loss), seq_relationship_score, (hidden_states), (attentions)
@add_start_docstrings("""Bert Model transformer BERT model with a sequence classification/regression head on top (a linear layer on top of
the pooled output) e.g. for GLUE tasks. """,
BERT_START_DOCSTRING, BERT_INPUTS_DOCSTRING)
class BertForSequenceClassification(BertPreTrainedModel): class BertForSequenceClassification(BertPreTrainedModel):
"""BERT model for classification. r"""
This module is composed of the BERT model with a linear layer on top of **labels**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size,)``:
the pooled output. Labels for computing the sequence classification/regression loss.
Indices should be in ``[0, ..., config.num_labels]``.
Params: If ``config.num_labels == 1`` a regression loss is computed (Mean-Square loss),
`config`: a BertConfig class instance with the configuration to build a new model If ``config.num_labels > 1`` a classification loss is computed (Cross-Entropy).
`output_attentions`: If True, also output attentions weights computed by the model at each layer. Default: False
`output_hidden_states`: If True, also output hidden states computed by the model at each layer. Default: False Outputs: `Tuple` comprising various elements depending on the configuration (config) and inputs:
`num_labels`: the number of classes for the classifier. Default = 2. **loss**: (`optional`, returned when ``labels`` is provided) ``torch.FloatTensor`` of shape ``(1,)``:
Classification (or regression if config.num_labels==1) loss.
Example:: **logits**: ``torch.FloatTensor`` of shape ``(batch_size, config.num_labels)``
Classification (or regression if config.num_labels==1) scores (before SoftMax).
config = BertConfig(vocab_size_or_config_json_file=32000, hidden_size=768, **attentions**: (`optional`, returned when ``config.output_attentions=True``)
num_hidden_layers=12, num_attention_heads=12, intermediate_size=3072) list of ``torch.FloatTensor`` (one for each layer) of shape ``(batch_size, num_heads, sequence_length, sequence_length)``:
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
num_labels = 2 **hidden_states**: (`optional`, returned when ``config.output_hidden_states=True``)
list of ``torch.FloatTensor`` (one for the output of each layer + the output of the embeddings)
of shape ``(batch_size, sequence_length, hidden_size)``:
Hidden-states of the model at the output of each layer plus the initial embedding outputs.
Examples::
>>> config = BertConfig.from_pretrained('bert-base-uncased')
>>> tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
>>>
>>> model = BertForSequenceClassification(config)
>>> input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0) # Batch size 1
>>> labels = torch.tensor([1]).unsqueeze(0) # Batch size 1
>>> outputs = model(input_ids, labels=labels)
>>> loss, logits = outputs[:1]
model = BertForSequenceClassification(config, num_labels)
""" """
def __init__(self, config): def __init__(self, config):
super(BertForSequenceClassification, self).__init__(config) super(BertForSequenceClassification, self).__init__(config)
...@@ -1027,40 +980,6 @@ class BertForSequenceClassification(BertPreTrainedModel): ...@@ -1027,40 +980,6 @@ class BertForSequenceClassification(BertPreTrainedModel):
self.apply(self.init_weights) self.apply(self.init_weights)
def forward(self, input_ids, token_type_ids=None, attention_mask=None, labels=None, head_mask=None): def forward(self, input_ids, token_type_ids=None, attention_mask=None, labels=None, head_mask=None):
"""
Performs a model forward pass. **Can be called by calling the class directly, once it has been instantiated.**
Parameters:
`input_ids`: a ``torch.LongTensor`` of shape [batch_size, sequence_length]
with the word token indices in the vocabulary. Items in the batch should begin with the special "CLS" token. (see the tokens preprocessing logic in the scripts
`run_bert_extract_features.py`, `run_bert_classifier.py` and `run_bert_squad.py`)
`token_type_ids`: an optional ``torch.LongTensor`` of shape [batch_size, sequence_length] with the token
types indices selected in [0, 1]. Type 0 corresponds to a `sentence A` and type 1 corresponds to
a `sentence B` token (see BERT paper for more details).
`attention_mask`: an optional ``torch.LongTensor`` of shape [batch_size, sequence_length] with indices
selected in [0, 1]. It's a mask to be used if the input sequence length is smaller than the max
input sequence length in the current batch. It's the mask that we typically use for attention when
a batch has varying length sentences.
`labels`: labels for the classification output: ``torch.LongTensor`` of shape [batch_size]
with indices selected in [0, ..., num_labels].
`head_mask`: an optional ``torch.Tensor`` of shape [num_heads] or [num_layers, num_heads] with indices between 0 and 1.
It's a mask to be used to nullify some heads of the transformer. 1.0 => head is fully masked, 0.0 => head is not masked.
Returns:
If ``labels`` is not ``None``, outputs the CrossEntropy classification loss of the output with the labels.
If ``labels`` is ``None``, outputs the classification logits of shape [batch_size, num_labels].
Example::
# Already been converted into WordPiece token ids
input_ids = torch.LongTensor([[31, 51, 99], [15, 5, 0]])
input_mask = torch.LongTensor([[1, 1, 1], [1, 1, 0]])
token_type_ids = torch.LongTensor([[0, 0, 1], [0, 1, 0]])
logits = model(input_ids, token_type_ids, input_mask)
# or
logits = model.forward(input_ids, token_type_ids, input_mask)
"""
outputs = self.bert(input_ids, token_type_ids, attention_mask, head_mask=head_mask) outputs = self.bert(input_ids, token_type_ids, attention_mask, head_mask=head_mask)
pooled_output = outputs[1] pooled_output = outputs[1]
...@@ -1082,26 +1001,78 @@ class BertForSequenceClassification(BertPreTrainedModel): ...@@ -1082,26 +1001,78 @@ class BertForSequenceClassification(BertPreTrainedModel):
return outputs # (loss), logits, (hidden_states), (attentions) return outputs # (loss), logits, (hidden_states), (attentions)
@add_start_docstrings("""Bert Model transformer BERT model with a multiple choice classification head on top (a linear layer on top of
the pooled output and a softmax) e.g. for RocStories/SWAG tasks. """,
BERT_START_DOCSTRING)
class BertForMultipleChoice(BertPreTrainedModel): class BertForMultipleChoice(BertPreTrainedModel):
"""BERT model for multiple choice tasks. r"""
This module is composed of the BERT model with a linear layer on top of the pooled output. Inputs:
**input_ids**: ``torch.LongTensor`` of shape ``(batch_size, num_choices, sequence_length)``:
Parameters: Indices of input sequence tokens in the vocabulary.
`config`: a BertConfig class instance with the configuration to build a new model The second dimension of the input (`num_choices`) indicates the number of choices to score.
`output_attentions`: If True, also output attentions weights computed by the model at each layer. Default: False To match pre-training, BERT input sequence should be formatted with [CLS] and [SEP] tokens as follows:
`output_hidden_states`: If True, also output hidden states computed by the model at each layer. Default: False
(a) For sequence pairs:
Example::
``tokens: [CLS] is this jack ##son ##ville ? [SEP] no it is not . [SEP]``
# Already been converted into WordPiece token ids
input_ids = torch.LongTensor([[[31, 51, 99], [15, 5, 0]], [[12, 16, 42], [14, 28, 57]]]) ``token_type_ids: 0 0 0 0 0 0 0 0 1 1 1 1 1 1``
input_mask = torch.LongTensor([[[1, 1, 1], [1, 1, 0]],[[1,1,0], [1, 0, 0]]])
token_type_ids = torch.LongTensor([[[0, 0, 1], [0, 1, 0]],[[0, 1, 1], [0, 0, 1]]]) (b) For single sequences:
config = BertConfig(vocab_size_or_config_json_file=32000, hidden_size=768,
num_hidden_layers=12, num_attention_heads=12, intermediate_size=3072) ``tokens: [CLS] the dog is hairy . [SEP]``
``token_type_ids: 0 0 0 0 0 0 0``
Indices can be obtained using :class:`pytorch_transformers.BertTokenizer`.
See :func:`pytorch_transformers.PreTrainedTokenizer.encode` and
:func:`pytorch_transformers.PreTrainedTokenizer.convert_tokens_to_ids` for details.
**token_type_ids**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size, num_choices, sequence_length)``:
Segment token indices to indicate first and second portions of the inputs.
The second dimension of the input (`num_choices`) indicates the number of choices to score.
Indices are selected in ``[0, 1]``: ``0`` corresponds to a `sentence A` token, ``1``
corresponds to a `sentence B` token
(see `BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding`_ for more details).
**attention_mask**: (`optional`) ``torch.Tensor`` of shape ``(batch_size, num_choices, sequence_length)``:
Mask to avoid performing attention on padding token indices.
The second dimension of the input (`num_choices`) indicates the number of choices to score.
Mask indices selected in ``[0, 1]``:
``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.
**head_mask**: (`optional`) ``torch.Tensor`` of shape ``(num_heads,)`` or ``(num_layers, num_heads)``:
Mask to nullify selected heads of the self-attention modules.
Mask indices selected in ``[0, 1]``:
``1`` indicates the head is **not masked**, ``0`` indicates the head is **masked**.
**labels**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size,)``:
Labels for computing the multiple choice classification loss.
Indices should be in ``[0, ..., num_choices]`` where `num_choices` is the size of the second dimension
of the input tensors. (see `input_ids` above)
Outputs: `Tuple` comprising various elements depending on the configuration (config) and inputs:
**loss**: (`optional`, returned when ``labels`` is provided) ``torch.FloatTensor`` of shape ``(1,)``:
Classification loss.
**classification_scores**: ``torch.FloatTensor`` of shape ``(batch_size, num_choices)`` where `num_choices` is the size of the second dimension
of the input tensors. (see `input_ids` above).
Classification scores (before SoftMax).
**attentions**: (`optional`, returned when ``config.output_attentions=True``)
list of ``torch.FloatTensor`` (one for each layer) of shape ``(batch_size, num_heads, sequence_length, sequence_length)``:
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
**hidden_states**: (`optional`, returned when ``config.output_hidden_states=True``)
list of ``torch.FloatTensor`` (one for the output of each layer + the output of the embeddings)
of shape ``(batch_size, sequence_length, hidden_size)``:
Hidden-states of the model at the output of each layer plus the initial embedding outputs.
Examples::
>>> config = BertConfig.from_pretrained('bert-base-uncased')
>>> tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
>>>
>>> model = BertForMultipleChoice(config)
>>> choices = ["Hello, my dog is cute", "Hello, my cat is amazing"]
>>> input_ids = torch.tensor([tokenizer.encode(s) for s in choices]).unsqueeze(0) # Batch size 1, 2 choices
>>> labels = torch.tensor(1).unsqueeze(0) # Batch size 1
>>> outputs = model(input_ids, labels=labels)
>>> loss, classification_scores = outputs[:1]
model = BertForMultipleChoice(config)
logits = model(input_ids, token_type_ids, input_mask)
""" """
def __init__(self, config): def __init__(self, config):
super(BertForMultipleChoice, self).__init__(config) super(BertForMultipleChoice, self).__init__(config)
...@@ -1113,42 +1084,6 @@ class BertForMultipleChoice(BertPreTrainedModel): ...@@ -1113,42 +1084,6 @@ class BertForMultipleChoice(BertPreTrainedModel):
self.apply(self.init_weights) self.apply(self.init_weights)
def forward(self, input_ids, token_type_ids=None, attention_mask=None, labels=None, head_mask=None): def forward(self, input_ids, token_type_ids=None, attention_mask=None, labels=None, head_mask=None):
"""
Performs a model forward pass. **Can be called by calling the class directly, once it has been instantiated.**
Parameters:
`input_ids`: a ``torch.LongTensor`` of shape [batch_size, num_choices, sequence_length]
with the word token indices in the vocabulary(see the tokens preprocessing logic in the scripts
`run_bert_extract_features.py`, `run_bert_classifier.py` and `run_bert_squad.py`)
`token_type_ids`: an optional ``torch.LongTensor`` of shape [batch_size, num_choices, sequence_length]
with the token types indices selected in [0, 1]. Type 0 corresponds to a `sentence A`
and type 1 corresponds to a `sentence B` token (see BERT paper for more details).
`attention_mask`: an optional ``torch.LongTensor`` of shape [batch_size, num_choices, sequence_length] with indices
selected in [0, 1]. It's a mask to be used if the input sequence length is smaller than the max
input sequence length in the current batch. It's the mask that we typically use for attention when
a batch has varying length sentences.
`labels`: labels for the classification output: ``torch.LongTensor`` of shape [batch_size]
with indices selected in [0, ..., num_choices].
`head_mask`: an optional ``torch.Tensor`` of shape [num_heads] or [num_layers, num_heads] with indices between 0 and 1.
It's a mask to be used to nullify some heads of the transformer. 1.0 => head is fully masked, 0.0 => head is not masked.
Returns:
If ``labels`` is not ``None``, outputs the CrossEntropy classification loss of the output with the labels.
If ``labels`` is ``None``, outputs the classification logits of shape [batch_size, num_labels].
Example::
# Already been converted into WordPiece token ids
input_ids = torch.LongTensor([[[31, 51, 99], [15, 5, 0]], [[12, 16, 42], [14, 28, 57]]])
input_mask = torch.LongTensor([[[1, 1, 1], [1, 1, 0]],[[1,1,0], [1, 0, 0]]])
token_type_ids = torch.LongTensor([[[0, 0, 1], [0, 1, 0]],[[0, 1, 1], [0, 0, 1]]])
config = BertConfig(vocab_size_or_config_json_file=32000, hidden_size=768,
num_hidden_layers=12, num_attention_heads=12, intermediate_size=3072)
model = BertForMultipleChoice(config)
logits = model(input_ids, token_type_ids, input_mask)
"""
""" Input shapes should be [bsz, num choices, seq length] """
num_choices = input_ids.shape[1] num_choices = input_ids.shape[1]
flat_input_ids = input_ids.view(-1, input_ids.size(-1)) flat_input_ids = input_ids.view(-1, input_ids.size(-1))
...@@ -1171,25 +1106,39 @@ class BertForMultipleChoice(BertPreTrainedModel): ...@@ -1171,25 +1106,39 @@ class BertForMultipleChoice(BertPreTrainedModel):
return outputs # (loss), reshaped_logits, (hidden_states), (attentions) return outputs # (loss), reshaped_logits, (hidden_states), (attentions)
@add_start_docstrings("""Bert Model transformer BERT model with a token classification head on top (a linear layer on top of
the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks. """,
BERT_START_DOCSTRING, BERT_INPUTS_DOCSTRING)
class BertForTokenClassification(BertPreTrainedModel): class BertForTokenClassification(BertPreTrainedModel):
"""BERT model for token-level classification. r"""
This module is composed of the BERT model with a linear layer on top of **labels**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size, sequence_length)``:
the full hidden state of the last layer. Labels for computing the token classification loss.
Indices should be in ``[0, ..., config.num_labels]``.
Parameters:
`config`: a BertConfig class instance with the configuration to build a new model Outputs: `Tuple` comprising various elements depending on the configuration (config) and inputs:
`output_attentions`: If True, also output attentions weights computed by the model at each layer. Default: False **loss**: (`optional`, returned when ``labels`` is provided) ``torch.FloatTensor`` of shape ``(1,)``:
`output_hidden_states`: If True, also output hidden states computed by the model at each layer. Default: False Classification loss.
`num_labels`: the number of classes for the classifier. Default = 2. **scores**: ``torch.FloatTensor`` of shape ``(batch_size, sequence_length, config.num_labels)``
Classification scores (before SoftMax).
Example:: **attentions**: (`optional`, returned when ``config.output_attentions=True``)
list of ``torch.FloatTensor`` (one for each layer) of shape ``(batch_size, num_heads, sequence_length, sequence_length)``:
config = BertConfig(vocab_size_or_config_json_file=32000, hidden_size=768, Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
num_hidden_layers=12, num_attention_heads=12, intermediate_size=3072) **hidden_states**: (`optional`, returned when ``config.output_hidden_states=True``)
list of ``torch.FloatTensor`` (one for the output of each layer + the output of the embeddings)
num_labels = 2 of shape ``(batch_size, sequence_length, hidden_size)``:
Hidden-states of the model at the output of each layer plus the initial embedding outputs.
Examples::
>>> config = BertConfig.from_pretrained('bert-base-uncased')
>>> tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
>>>
>>> model = BertForTokenClassification(config)
>>> input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0) # Batch size 1
>>> labels = torch.tensor([1] * input_ids.size(1)).unsqueeze(0) # Batch size 1
>>> outputs = model(input_ids, labels=labels)
>>> loss, scores = outputs[:1]
model = BertForTokenClassification(config, num_labels)
""" """
def __init__(self, config): def __init__(self, config):
super(BertForTokenClassification, self).__init__(config) super(BertForTokenClassification, self).__init__(config)
...@@ -1202,40 +1151,6 @@ class BertForTokenClassification(BertPreTrainedModel): ...@@ -1202,40 +1151,6 @@ class BertForTokenClassification(BertPreTrainedModel):
self.apply(self.init_weights) self.apply(self.init_weights)
def forward(self, input_ids, token_type_ids=None, attention_mask=None, labels=None, head_mask=None): def forward(self, input_ids, token_type_ids=None, attention_mask=None, labels=None, head_mask=None):
"""
Performs a model forward pass. **Can be called by calling the class directly, once it has been instantiated.**
Parameters:
`input_ids`: a ``torch.LongTensor`` of shape [batch_size, sequence_length]
with the word token indices in the vocabulary(see the tokens pre-processing logic in the scripts
`run_bert_extract_features.py`, `run_bert_classifier.py` and `run_bert_squad.py`)
`token_type_ids`: an optional ``torch.LongTensor`` of shape [batch_size, sequence_length] with the token
types indices selected in [0, 1]. Type 0 corresponds to a `sentence A` and type 1 corresponds to
a `sentence B` token (see BERT paper for more details).
`attention_mask`: an optional ``torch.LongTensor`` of shape [batch_size, sequence_length] with indices
selected in [0, 1]. It's a mask to be used if the input sequence length is smaller than the max
input sequence length in the current batch. It's the mask that we typically use for attention when
a batch has varying length sentences.
`labels`: labels for the classification output: ``torch.LongTensor`` of shape [batch_size, sequence_length]
with indices selected in [0, ..., num_labels].
`head_mask`: an optional ``torch.Tensor`` of shape [num_heads] or [num_layers, num_heads] with indices between 0 and 1.
It's a mask to be used to nullify some heads of the transformer. 1.0 => head is fully masked, 0.0 => head is not masked.
Returns:
If ``labels`` is not ``None``, outputs the CrossEntropy classification loss of the output with the labels.
If ``labels`` is ``None``, outputs the classification logits of shape [batch_size, sequence_length, num_labels].
Example::
# Already been converted into WordPiece token ids
input_ids = torch.LongTensor([[31, 51, 99], [15, 5, 0]])
input_mask = torch.LongTensor([[1, 1, 1], [1, 1, 0]])
token_type_ids = torch.LongTensor([[0, 0, 1], [0, 1, 0]])
logits = model(input_ids, token_type_ids, input_mask)
# or
logits = model.forward(input_ids, token_type_ids, input_mask)
"""
outputs = self.bert(input_ids, token_type_ids, attention_mask, head_mask=head_mask) outputs = self.bert(input_ids, token_type_ids, attention_mask, head_mask=head_mask)
sequence_output = outputs[0] sequence_output = outputs[0]
...@@ -1255,25 +1170,50 @@ class BertForTokenClassification(BertPreTrainedModel): ...@@ -1255,25 +1170,50 @@ class BertForTokenClassification(BertPreTrainedModel):
loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1)) loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
outputs = (loss,) + outputs outputs = (loss,) + outputs
return outputs # (loss), logits, (hidden_states), (attentions) return outputs # (loss), scores, (hidden_states), (attentions)
@add_start_docstrings("""Bert Model transformer BERT model with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layers on top of
the hidden-states output to compute `span start logits` and `span end logits`). """,
BERT_START_DOCSTRING, BERT_INPUTS_DOCSTRING)
class BertForQuestionAnswering(BertPreTrainedModel): class BertForQuestionAnswering(BertPreTrainedModel):
"""BERT model for Question Answering (span extraction). r"""
This module is composed of the BERT model with a linear layer on top of **start_positions**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size,)``:
the sequence output that computes start_logits and end_logits Position (index) of the start of the labelled span for computing the token classification loss.
Positions are clamped to the length of the sequence (`sequence_length`).
Parameters: Position outside of the sequence are not taken into account for computing the loss.
`config`: a BertConfig class instance with the configuration to build a new model **end_positions**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size,)``:
`output_attentions`: If True, also output attentions weights computed by the model at each layer. Default: False Position (index) of the end of the labelled span for computing the token classification loss.
`output_hidden_states`: If True, also output hidden states computed by the model at each layer. Default: False Positions are clamped to the length of the sequence (`sequence_length`).
Position outside of the sequence are not taken into account for computing the loss.
Example::
Outputs: `Tuple` comprising various elements depending on the configuration (config) and inputs:
config = BertConfig(vocab_size_or_config_json_file=32000, hidden_size=768, **loss**: (`optional`, returned when ``labels`` is provided) ``torch.FloatTensor`` of shape ``(1,)``:
num_hidden_layers=12, num_attention_heads=12, intermediate_size=3072) Total span extraction loss is the sum of a Cross-Entropy for the start and end positions.
**start_scores**: ``torch.FloatTensor`` of shape ``(batch_size, sequence_length,)``
Span-start scores (before SoftMax).
**end_scores**: ``torch.FloatTensor`` of shape ``(batch_size, sequence_length,)``
Span-end scores (before SoftMax).
**attentions**: (`optional`, returned when ``config.output_attentions=True``)
list of ``torch.FloatTensor`` (one for each layer) of shape ``(batch_size, num_heads, sequence_length, sequence_length)``:
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
**hidden_states**: (`optional`, returned when ``config.output_hidden_states=True``)
list of ``torch.FloatTensor`` (one for the output of each layer + the output of the embeddings)
of shape ``(batch_size, sequence_length, hidden_size)``:
Hidden-states of the model at the output of each layer plus the initial embedding outputs.
Examples::
>>> config = BertConfig.from_pretrained('bert-base-uncased')
>>> tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
>>>
>>> model = BertForQuestionAnswering(config)
>>> input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0) # Batch size 1
>>> start_positions = torch.tensor([1])
>>> end_positions = torch.tensor([3])
>>> outputs = model(input_ids, start_positions=start_positions, end_positions=end_positions)
>>> loss, start_scores, end_scores = outputs[:2]
model = BertForQuestionAnswering(config)
""" """
def __init__(self, config): def __init__(self, config):
super(BertForQuestionAnswering, self).__init__(config) super(BertForQuestionAnswering, self).__init__(config)
...@@ -1286,44 +1226,6 @@ class BertForQuestionAnswering(BertPreTrainedModel): ...@@ -1286,44 +1226,6 @@ class BertForQuestionAnswering(BertPreTrainedModel):
def forward(self, input_ids, token_type_ids=None, attention_mask=None, start_positions=None, def forward(self, input_ids, token_type_ids=None, attention_mask=None, start_positions=None,
end_positions=None, head_mask=None): end_positions=None, head_mask=None):
"""
Performs a model forward pass. **Can be called by calling the class directly, once it has been instantiated.**
Parameters:
`input_ids`: a ``torch.LongTensor`` of shape [batch_size, sequence_length]
with the word token indices in the vocabulary(see the tokens preprocessing logic in the scripts
`run_bert_extract_features.py`, `run_bert_classifier.py` and `run_bert_squad.py`)
`token_type_ids`: an optional ``torch.LongTensor`` of shape [batch_size, sequence_length] with the token
types indices selected in [0, 1]. Type 0 corresponds to a `sentence A` and type 1 corresponds to
a `sentence B` token (see BERT paper for more details).
`attention_mask`: an optional ``torch.LongTensor`` of shape [batch_size, sequence_length] with indices
selected in [0, 1]. It's a mask to be used if the input sequence length is smaller than the max
input sequence length in the current batch. It's the mask that we typically use for attention when
a batch has varying length sentences.
`start_positions`: position of the first token for the labeled span: ``torch.LongTensor`` of shape [batch_size].
Positions are clamped to the length of the sequence and position outside of the sequence are not taken
into account for computing the loss.
`end_positions`: position of the last token for the labeled span: ``torch.LongTensor`` of shape [batch_size].
Positions are clamped to the length of the sequence and position outside of the sequence are not taken
into account for computing the loss.
`head_mask`: an optional ``torch.Tensor`` of shape [num_heads] or [num_layers, num_heads] with indices between 0 and 1.
It's a mask to be used to nullify some heads of the transformer. 1.0 => head is fully masked, 0.0 => head is not masked.
Returns:
If ``start_positions`` and ``end_positions`` are not ``None``, outputs the total_loss which is the sum of the
CrossEntropy loss for the start and end token positions.
If ``start_positions`` or ``end_positions`` is ``None``, outputs a tuple of start_logits, end_logits which are the
logits respectively for the start and end position tokens of shape [batch_size, sequence_length].
Example::
# Already been converted into WordPiece token ids
input_ids = torch.LongTensor([[31, 51, 99], [15, 5, 0]])
input_mask = torch.LongTensor([[1, 1, 1], [1, 1, 0]])
token_type_ids = torch.LongTensor([[0, 0, 1], [0, 1, 0]])
start_logits, end_logits = model(input_ids, token_type_ids, input_mask)
"""
outputs = self.bert(input_ids, token_type_ids, attention_mask, head_mask=head_mask) outputs = self.bert(input_ids, token_type_ids, attention_mask, head_mask=head_mask)
sequence_output = outputs[0] sequence_output = outputs[0]
......
...@@ -36,6 +36,13 @@ WEIGHTS_NAME = "pytorch_model.bin" ...@@ -36,6 +36,13 @@ WEIGHTS_NAME = "pytorch_model.bin"
TF_WEIGHTS_NAME = 'model.ckpt' TF_WEIGHTS_NAME = 'model.ckpt'
def add_start_docstrings(*docstr):
def docstring_decorator(fn):
fn.__doc__ = ''.join(docstr) + fn.__doc__
return fn
return docstring_decorator
class PretrainedConfig(object): class PretrainedConfig(object):
""" An abstract class to handle dowloading a model pretrained config. """ An abstract class to handle dowloading a model pretrained config.
""" """
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment