BIG Reorganize examples (#4213)

* Created using Colaboratory * [examples] reorganize files * remove run_tpu_glue.py as superseded by TPU support in Trainer * Bugfix: int, not tuple * move files around

BIG Reorganize examples (#4213)
* Created using Colaboratory * [examples] reorganize files * remove run_tpu_glue.py as superseded by TPU support in Trainer * Bugfix: int, not tuple * move files around
0ae96ff8 · Julien Chaumond · GitHub · cafa6a9e · 0ae96ff8 · 0ae96ff8
Commit 0ae96ff8 authored May 07, 2020 by Julien Chaumond Committed by GitHub May 07, 2020
20 changed files
--- a/examples/run_multiple_choice.py
+++ b/examples/run_multiple_choice.py
--- a/examples/utils_multiple_choice.py
+++ b/examples/utils_multiple_choice.py
--- a/examples/question-answering/README.md
+++ b/examples/question-answering/README.md
+
+
+## SQuAD
+
+Based on the script [`run_squad.py`](https://github.com/huggingface/transformers/blob/master/examples/run_squad.py).
+
+#### Fine-tuning BERT on SQuAD1.0
+
+This example code fine-tunes BERT on the SQuAD1.0 dataset. It runs in 24 min (with BERT-base) or 68 min (with BERT-large)
+on a single tesla V100 16GB. The data for SQuAD can be downloaded with the following links and should be saved in a
+$SQUAD_DIR directory.
+
+* [train-v1.1.json](https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v1.1.json)
+* [dev-v1.1.json](https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json)
+* [evaluate-v1.1.py](https://github.com/allenai/bi-att-flow/blob/master/squad/evaluate-v1.1.py)
+
+And for SQuAD2.0, you need to download:
+
+- [train-v2.0.json](https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json)
+- [dev-v2.0.json](https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json)
+- [evaluate-v2.0.py](https://worksheets.codalab.org/rest/bundles/0x6b567e1cf2e041ec80d7098f031c5c9e/contents/blob/)
+
+```bash
+export SQUAD_DIR=/path/to/SQUAD
+
+python run_squad.py \
+  --model_type bert \
+  --model_name_or_path bert-base-uncased \
+  --do_train \
+  --do_eval \
+  --train_file $SQUAD_DIR/train-v1.1.json \
+  --predict_file $SQUAD_DIR/dev-v1.1.json \
+  --per_gpu_train_batch_size 12 \
+  --learning_rate 3e-5 \
+  --num_train_epochs 2.0 \
+  --max_seq_length 384 \
+  --doc_stride 128 \
+  --output_dir /tmp/debug_squad/
+```
+
+Training with the previously defined hyper-parameters yields the following results:
+
+```bash
+f1 = 88.52
+exact_match = 81.22
+```
+
+#### Distributed training
+
+
+Here is an example using distributed training on 8 V100 GPUs and Bert Whole Word Masking uncased model to reach a F1 > 93 on SQuAD1.1:
+
+```bash
+python -m torch.distributed.launch --nproc_per_node=8 ./examples/run_squad.py \
+    --model_type bert \
+    --model_name_or_path bert-large-uncased-whole-word-masking \
+    --do_train \
+    --do_eval \
+    --train_file $SQUAD_DIR/train-v1.1.json \
+    --predict_file $SQUAD_DIR/dev-v1.1.json \
+    --learning_rate 3e-5 \
+    --num_train_epochs 2 \
+    --max_seq_length 384 \
+    --doc_stride 128 \
+    --output_dir ./examples/models/wwm_uncased_finetuned_squad/ \
+    --per_gpu_eval_batch_size=3   \
+    --per_gpu_train_batch_size=3   \
+```
+
+Training with the previously defined hyper-parameters yields the following results:
+
+```bash
+f1 = 93.15
+exact_match = 86.91
+```
+
+This fine-tuned model is available as a checkpoint under the reference
+`bert-large-uncased-whole-word-masking-finetuned-squad`.
+
+#### Fine-tuning XLNet on SQuAD
+
+This example code fine-tunes XLNet on both SQuAD1.0 and SQuAD2.0 dataset. See above to download the data for SQuAD .
+
+##### Command for SQuAD1.0:
+
+```bash
+export SQUAD_DIR=/path/to/SQUAD
+
+python run_squad.py \
+    --model_type xlnet \
+    --model_name_or_path xlnet-large-cased \
+    --do_train \
+    --do_eval \
+    --train_file $SQUAD_DIR/train-v1.1.json \
+    --predict_file $SQUAD_DIR/dev-v1.1.json \
+    --learning_rate 3e-5 \
+    --num_train_epochs 2 \
+    --max_seq_length 384 \
+    --doc_stride 128 \
+    --output_dir ./wwm_cased_finetuned_squad/ \
+    --per_gpu_eval_batch_size=4  \
+    --per_gpu_train_batch_size=4   \
+    --save_steps 5000
+```
+
+##### Command for SQuAD2.0:
+
+```bash
+export SQUAD_DIR=/path/to/SQUAD
+
+python run_squad.py \
+    --model_type xlnet \
+    --model_name_or_path xlnet-large-cased \
+    --do_train \
+    --do_eval \
+    --version_2_with_negative \
+    --train_file $SQUAD_DIR/train-v2.0.json \
+    --predict_file $SQUAD_DIR/dev-v2.0.json \
+    --learning_rate 3e-5 \
+    --num_train_epochs 4 \
+    --max_seq_length 384 \
+    --doc_stride 128 \
+    --output_dir ./wwm_cased_finetuned_squad/ \
+    --per_gpu_eval_batch_size=2  \
+    --per_gpu_train_batch_size=2   \
+    --save_steps 5000
+```
+
+Larger batch size may improve the performance while costing more memory.
+
+##### Results for SQuAD1.0 with the previously defined hyper-parameters:
+
+```python
+{
+"exact": 85.45884578997162,
+"f1": 92.5974600601065,
+"total": 10570,
+"HasAns_exact": 85.45884578997162,
+"HasAns_f1": 92.59746006010651,
+"HasAns_total": 10570
+}
+```
+
+##### Results for SQuAD2.0 with the previously defined hyper-parameters:
+
+```python
+{
+"exact": 80.4177545691906,
+"f1": 84.07154997729623,
+"total": 11873,
+"HasAns_exact": 76.73751686909581,
+"HasAns_f1": 84.05558584352873,
+"HasAns_total": 5928,
+"NoAns_exact": 84.0874684608915,
+"NoAns_f1": 84.0874684608915,
+"NoAns_total": 5945
+}
+```
+
--- a/examples/run_squad.py
+++ b/examples/run_squad.py
--- a/examples/requirements.txt
+++ b/examples/requirements.txt
-tensorboardX
 tensorboard
 scikit-learn
 seqeval

--- a/examples/run_tpu_glue.py
+++ b/examples/run_tpu_glue.py
-# coding=utf-8
-# Copyright 2019 The Google AI Language Team Authors and The HuggingFace Inc. team.
-# Copyright (c) 2019, NVIDIA CORPORATION.  All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-""" Finetuning the library models for sequence classification on GLUE (Bert, DistilBert, XLNet, RoBERTa)."""
-
-from __future__ import absolute_import, division, print_function
-
-import argparse
-import glob
-import logging
-import os
-import random
-
-import numpy as np
-import torch
-import torch_xla.core.xla_model as xm
-import torch_xla.debug.metrics as met
-import torch_xla.distributed.parallel_loader as pl
-import torch_xla.distributed.xla_multiprocessing as xmp
-from torch.utils.data import DataLoader, RandomSampler, TensorDataset
-from torch.utils.data.distributed import DistributedSampler
-from tqdm import tqdm, trange
-
-from transformers import (
-    WEIGHTS_NAME,
-    AdamW,
-    BertConfig,
-    BertForSequenceClassification,
-    BertTokenizer,
-    DistilBertConfig,
-    DistilBertForSequenceClassification,
-    DistilBertTokenizer,
-    RobertaConfig,
-    RobertaForSequenceClassification,
-    RobertaTokenizer,
-    XLMConfig,
-    XLMForSequenceClassification,
-    XLMTokenizer,
-    XLNetConfig,
-    XLNetForSequenceClassification,
-    XLNetTokenizer,
-    get_linear_schedule_with_warmup,
-)
-from transformers import glue_compute_metrics as compute_metrics
-from transformers import glue_convert_examples_to_features as convert_examples_to_features
-from transformers import glue_output_modes as output_modes
-from transformers import glue_processors as processors
-
-
-try:
-    # Only tensorboardX supports writing directly to gs://
-    from tensorboardX import SummaryWriter
-except ImportError:
-    from torch.utils.tensorboard import SummaryWriter
-
-
-logger = logging.getLogger(__name__)
-
-ALL_MODELS = sum(
-    (
-        tuple(conf.pretrained_config_archive_map.keys())
-        for conf in (BertConfig, XLNetConfig, XLMConfig, RobertaConfig, DistilBertConfig)
-    ),
-    (),
-)
-
-MODEL_CLASSES = {
-    "bert": (BertConfig, BertForSequenceClassification, BertTokenizer),
-    "xlnet": (XLNetConfig, XLNetForSequenceClassification, XLNetTokenizer),
-    "xlm": (XLMConfig, XLMForSequenceClassification, XLMTokenizer),
-    "roberta": (RobertaConfig, RobertaForSequenceClassification, RobertaTokenizer),
-    "distilbert": (DistilBertConfig, DistilBertForSequenceClassification, DistilBertTokenizer),
-}
-
-
-def set_seed(seed):
-    random.seed(seed)
-    np.random.seed(seed)
-    torch.manual_seed(seed)
-
-
-def get_sampler(dataset):
-    if xm.xrt_world_size() <= 1:
-        return RandomSampler(dataset)
-    return DistributedSampler(dataset, num_replicas=xm.xrt_world_size(), rank=xm.get_ordinal())
-
-
-def train(args, train_dataset, model, tokenizer, disable_logging=False):
-    """ Train the model """
-    if xm.is_master_ordinal():
-        # Only master writes to Tensorboard
-        tb_writer = SummaryWriter(args.tensorboard_logdir)
-
-    train_sampler = get_sampler(train_dataset)
-    dataloader = DataLoader(train_dataset, sampler=train_sampler, batch_size=args.train_batch_size)
-
-    if args.max_steps > 0:
-        t_total = args.max_steps
-        args.num_train_epochs = args.max_steps // (len(dataloader) // args.gradient_accumulation_steps) + 1
-    else:
-        t_total = len(dataloader) // args.gradient_accumulation_steps * args.num_train_epochs
-
-    # Prepare optimizer and schedule (linear warmup and decay)
-    no_decay = ["bias", "LayerNorm.weight"]
-    optimizer_grouped_parameters = [
-        {
-            "params": [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)],
-            "weight_decay": args.weight_decay,
-        },
-        {"params": [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)], "weight_decay": 0.0},
-    ]
-    optimizer = AdamW(optimizer_grouped_parameters, lr=args.learning_rate, eps=args.adam_epsilon)
-    scheduler = get_linear_schedule_with_warmup(
-        optimizer, num_warmup_steps=args.warmup_steps, num_training_steps=t_total,
-    )
-
-    # Train!
-    logger.info("***** Running training *****")
-    logger.info("  Num examples = %d", len(dataloader) * args.train_batch_size)
-    logger.info("  Num Epochs = %d", args.num_train_epochs)
-    logger.info("  Instantaneous batch size per TPU core = %d", args.train_batch_size)
-    logger.info(
-        "  Total train batch size (w. parallel, distributed & accumulation) = %d",
-        (args.train_batch_size * args.gradient_accumulation_steps * xm.xrt_world_size()),
-    )
-    logger.info("  Gradient Accumulation steps = %d", args.gradient_accumulation_steps)
-    logger.info("  Total optimization steps = %d", t_total)
-
-    global_step = 0
-    loss = None
-    model.zero_grad()
-    train_iterator = trange(int(args.num_train_epochs), desc="Epoch", disable=disable_logging)
-    set_seed(args.seed)  # Added here for reproductibility (even between python 2 and 3)
-    for epoch in train_iterator:
-        # tpu-comment: Get TPU parallel loader which sends data to TPU in background.
-        train_dataloader = pl.ParallelLoader(dataloader, [args.device]).per_device_loader(args.device)
-        epoch_iterator = tqdm(train_dataloader, desc="Iteration", total=len(dataloader), disable=disable_logging)
-        for step, batch in enumerate(epoch_iterator):
-
-            # Save model checkpoint.
-            if args.save_steps > 0 and global_step % args.save_steps == 0:
-                output_dir = os.path.join(args.output_dir, "checkpoint-{}".format(global_step))
-                logger.info("Saving model checkpoint to %s", output_dir)
-
-                if xm.is_master_ordinal():
-                    if not os.path.exists(output_dir):
-                        os.makedirs(output_dir)
-                    torch.save(args, os.path.join(output_dir, "training_args.bin"))
-
-                # Barrier to wait for saving checkpoint.
-                xm.rendezvous("mid_training_checkpoint")
-                # model.save_pretrained needs to be called by all ordinals
-                model.save_pretrained(output_dir)
-
-            model.train()
-            inputs = {"input_ids": batch[0], "attention_mask": batch[1], "labels": batch[3]}
-            if args.model_type != "distilbert":
-                # XLM, DistilBERT and RoBERTa don't use segment_ids
-                inputs["token_type_ids"] = batch[2] if args.model_type in ["bert", "xlnet"] else None
-            outputs = model(**inputs)
-            loss = outputs[0]  # model outputs are always tuple in transformers (see doc)
-
-            if args.gradient_accumulation_steps > 1:
-                loss = loss / args.gradient_accumulation_steps
-
-            loss.backward()
-
-            if (step + 1) % args.gradient_accumulation_steps == 0:
-                torch.nn.utils.clip_grad_norm_(model.parameters(), args.max_grad_norm)
-
-                xm.optimizer_step(optimizer)
-                scheduler.step()  # Update learning rate schedule
-                model.zero_grad()
-                global_step += 1
-
-                if args.logging_steps > 0 and global_step % args.logging_steps == 0:
-                    # Log metrics.
-                    results = {}
-                    if args.evaluate_during_training:
-                        results = evaluate(args, model, tokenizer, disable_logging=disable_logging)
-                    loss_scalar = loss.item()
-                    logger.info(
-                        "global_step: {global_step}, lr: {lr:.6f}, loss: {loss:.3f}".format(
-                            global_step=global_step, lr=scheduler.get_lr()[0], loss=loss_scalar
-                        )
-                    )
-                    if xm.is_master_ordinal():
-                        # tpu-comment: All values must be in CPU and not on TPU device
-                        for key, value in results.items():
-                            tb_writer.add_scalar("eval_{}".format(key), value, global_step)
-                        tb_writer.add_scalar("lr", scheduler.get_lr()[0], global_step)
-                        tb_writer.add_scalar("loss", loss_scalar, global_step)
-
-            if args.max_steps > 0 and global_step > args.max_steps:
-                epoch_iterator.close()
-                break
-        if args.metrics_debug:
-            # tpu-comment: Logging debug metrics for PyTorch/XLA (compile, execute times, ops, etc.)
-            xm.master_print(met.metrics_report())
-        if args.max_steps > 0 and global_step > args.max_steps:
-            train_iterator.close()
-            break
-
-    if xm.is_master_ordinal():
-        tb_writer.close()
-    return global_step, loss.item()
-
-
-def evaluate(args, model, tokenizer, prefix="", disable_logging=False):
-    """Evaluate the model"""
-    if xm.is_master_ordinal():
-        # Only master writes to Tensorboard
-        tb_writer = SummaryWriter(args.tensorboard_logdir)
-
-    # Loop to handle MNLI double evaluation (matched, mis-matched)
-    eval_task_names = ("mnli", "mnli-mm") if args.task_name == "mnli" else (args.task_name,)
-    eval_outputs_dirs = (args.output_dir, args.output_dir + "-MM") if args.task_name == "mnli" else (args.output_dir,)
-
-    results = {}
-    for eval_task, eval_output_dir in zip(eval_task_names, eval_outputs_dirs):
-        eval_dataset = load_and_cache_examples(args, eval_task, tokenizer, evaluate=True)
-        eval_sampler = get_sampler(eval_dataset)
-
-        if not os.path.exists(eval_output_dir):
-            os.makedirs(eval_output_dir)
-
-        dataloader = DataLoader(eval_dataset, sampler=eval_sampler, batch_size=args.eval_batch_size, shuffle=False)
-        eval_dataloader = pl.ParallelLoader(dataloader, [args.device]).per_device_loader(args.device)
-
-        # Eval!
-        logger.info("***** Running evaluation {} *****".format(prefix))
-        logger.info("  Num examples = %d", len(dataloader) * args.eval_batch_size)
-        logger.info("  Batch size = %d", args.eval_batch_size)
-        eval_loss = 0.0
-        nb_eval_steps = 0
-        preds = None
-        out_label_ids = None
-        for batch in tqdm(eval_dataloader, desc="Evaluating", disable=disable_logging):
-            model.eval()
-
-            with torch.no_grad():
-                inputs = {"input_ids": batch[0], "attention_mask": batch[1], "labels": batch[3]}
-                if args.model_type != "distilbert":
-                    # XLM, DistilBERT and RoBERTa don't use segment_ids
-                    inputs["token_type_ids"] = batch[2] if args.model_type in ["bert", "xlnet"] else None
-                outputs = model(**inputs)
-                batch_eval_loss, logits = outputs[:2]
-
-                eval_loss += batch_eval_loss
-            nb_eval_steps += 1
-            if preds is None:
-                preds = logits.detach().cpu().numpy()
-                out_label_ids = inputs["labels"].detach().cpu().numpy()
-            else:
-                preds = np.append(preds, logits.detach().cpu().numpy(), axis=0)
-                out_label_ids = np.append(out_label_ids, inputs["labels"].detach().cpu().numpy(), axis=0)
-
-        # tpu-comment: Get all predictions and labels from all worker shards of eval dataset
-        preds = xm.mesh_reduce("eval_preds", preds, np.concatenate)
-        out_label_ids = xm.mesh_reduce("eval_out_label_ids", out_label_ids, np.concatenate)
-
-        eval_loss = eval_loss / nb_eval_steps
-        if args.output_mode == "classification":
-            preds = np.argmax(preds, axis=1)
-        elif args.output_mode == "regression":
-            preds = np.squeeze(preds)
-        result = compute_metrics(eval_task, preds, out_label_ids)
-        results.update(result)
-        results["eval_loss"] = eval_loss.item()
-
-        output_eval_file = os.path.join(eval_output_dir, prefix, "eval_results.txt")
-        if xm.is_master_ordinal():
-            with open(output_eval_file, "w") as writer:
-                logger.info("***** Eval results {} *****".format(prefix))
-                for key in sorted(results.keys()):
-                    logger.info("  %s = %s", key, str(results[key]))
-                    writer.write("%s = %s\n" % (key, str(results[key])))
-                    tb_writer.add_scalar(f"{eval_task}/{key}", results[key])
-
-    if args.metrics_debug:
-        # tpu-comment: Logging debug metrics for PyTorch/XLA (compile, execute times, ops, etc.)
-        xm.master_print(met.metrics_report())
-
-    if xm.is_master_ordinal():
-        tb_writer.close()
-
-    return results
-
-
-def load_and_cache_examples(args, task, tokenizer, evaluate=False):
-    if not xm.is_master_ordinal():
-        xm.rendezvous("load_and_cache_examples")
-
-    processor = processors[task]()
-    output_mode = output_modes[task]
-    cached_features_file = os.path.join(
-        args.cache_dir,
-        "cached_{}_{}_{}_{}".format(
-            "dev" if evaluate else "train",
-            list(filter(None, args.model_name_or_path.split("/"))).pop(),
-            str(args.max_seq_length),
-            str(task),
-        ),
-    )
-
-    # Load data features from cache or dataset file
-    if os.path.exists(cached_features_file) and not args.overwrite_cache:
-        logger.info("Loading features from cached file %s", cached_features_file)
-        features = torch.load(cached_features_file)
-    else:
-        logger.info("Creating features from dataset file at %s", args.data_dir)
-        label_list = processor.get_labels()
-        if task in ["mnli", "mnli-mm"] and args.model_type in ["roberta"]:
-            # HACK(label indices are swapped in RoBERTa pretrained model)
-            label_list[1], label_list[2] = label_list[2], label_list[1]
-        examples = (
-            processor.get_dev_examples(args.data_dir) if evaluate else processor.get_train_examples(args.data_dir)
-        )
-        features = convert_examples_to_features(
-            examples, tokenizer, max_length=args.max_seq_length, label_list=label_list, output_mode=output_mode,
-        )
-        logger.info("Saving features into cached file %s", cached_features_file)
-        torch.save(features, cached_features_file)
-
-    if xm.is_master_ordinal():
-        xm.rendezvous("load_and_cache_examples")
-
-    # Convert to Tensors and build dataset
-    all_input_ids = torch.tensor([f.input_ids for f in features], dtype=torch.long)
-    all_attention_mask = torch.tensor([f.attention_mask for f in features], dtype=torch.long)
-    all_token_type_ids = torch.tensor([f.token_type_ids for f in features], dtype=torch.long)
-    if output_mode == "classification":
-        all_labels = torch.tensor([f.label for f in features], dtype=torch.long)
-    elif output_mode == "regression":
-        all_labels = torch.tensor([f.label for f in features], dtype=torch.float)
-
-    dataset = TensorDataset(all_input_ids, all_attention_mask, all_token_type_ids, all_labels)
-    return dataset
-
-
-def main(args):
-    if (
-        os.path.exists(args.output_dir)
-        and os.listdir(args.output_dir)
-        and args.do_train
-        and not args.overwrite_output_dir
-    ):
-        raise ValueError(
-            (
-                "Output directory ({}) already exists and is not empty." " Use --overwrite_output_dir to overcome."
-            ).format(args.output_dir)
-        )
-
-    # tpu-comment: Get TPU/XLA Device
-    args.device = xm.xla_device()
-
-    # Setup logging
-    logging.basicConfig(
-        format="[xla:{}] %(asctime)s - %(levelname)s - %(name)s -   %(message)s".format(xm.get_ordinal()),
-        datefmt="%m/%d/%Y %H:%M:%S",
-        level=logging.INFO,
-    )
-    disable_logging = False
-    if not xm.is_master_ordinal() and args.only_log_master:
-        # Disable all non-master loggers below CRITICAL.
-        logging.disable(logging.CRITICAL)
-        disable_logging = True
-    logger.warning("Process rank: %s, device: %s, num_cores: %s", xm.get_ordinal(), args.device, args.num_cores)
-
-    # Set seed to have same initialization
-    set_seed(args.seed)
-
-    # Prepare GLUE task
-    args.task_name = args.task_name.lower()
-    if args.task_name not in processors:
-        raise ValueError("Task not found: %s" % (args.task_name))
-    processor = processors[args.task_name]()
-    args.output_mode = output_modes[args.task_name]
-    label_list = processor.get_labels()
-    num_labels = len(label_list)
-
-    if not xm.is_master_ordinal():
-        xm.rendezvous(
-            "download_only_once"
-        )  # Make sure only the first process in distributed training will download model & vocab
-
-    # Load pretrained model and tokenizer
-    args.model_type = args.model_type.lower()
-    config_class, model_class, tokenizer_class = MODEL_CLASSES[args.model_type]
-    config = config_class.from_pretrained(
-        args.config_name if args.config_name else args.model_name_or_path,
-        num_labels=num_labels,
-        finetuning_task=args.task_name,
-        cache_dir=args.cache_dir if args.cache_dir else None,
-        xla_device=True,
-    )
-    tokenizer = tokenizer_class.from_pretrained(
-        args.tokenizer_name if args.tokenizer_name else args.model_name_or_path,
-        do_lower_case=args.do_lower_case,
-        cache_dir=args.cache_dir if args.cache_dir else None,
-    )
-    model = model_class.from_pretrained(
-        args.model_name_or_path,
-        from_tf=bool(".ckpt" in args.model_name_or_path),
-        config=config,
-        cache_dir=args.cache_dir if args.cache_dir else None,
-    )
-
-    if xm.is_master_ordinal():
-        xm.rendezvous("download_only_once")
-
-    # Send model to TPU/XLA device.
-    model.to(args.device)
-
-    logger.info("Training/evaluation parameters %s", args)
-
-    if args.do_train:
-        # Train the model.
-        train_dataset = load_and_cache_examples(args, args.task_name, tokenizer, evaluate=False)
-        global_step, tr_loss = train(args, train_dataset, model, tokenizer, disable_logging=disable_logging)
-        logger.info(" global_step = %s, average loss = %s", global_step, tr_loss)
-
-        if xm.is_master_ordinal():
-            # Save trained model.
-            # Saving best-practices: if you use defaults names for the model, you can reload it using from_pretrained()
-
-            # Create output directory if needed
-            if not os.path.exists(args.output_dir):
-                os.makedirs(args.output_dir)
-
-            logger.info("Saving model checkpoint to %s", args.output_dir)
-            # Save a trained model, configuration and tokenizer using `save_pretrained()`.
-            # They can then be reloaded using `from_pretrained()`
-            tokenizer.save_pretrained(args.output_dir)
-            # Good practice: save your training arguments together with the trained.
-            torch.save(args, os.path.join(args.output_dir, "training_args.bin"))
-
-        xm.rendezvous("post_training_checkpoint")
-        # model.save_pretrained needs to be called by all ordinals
-        model.save_pretrained(args.output_dir)
-
-        # Load a trained model and vocabulary that you have fine-tuned
-        model = model_class.from_pretrained(args.output_dir)
-        tokenizer = tokenizer_class.from_pretrained(args.output_dir)
-        model.to(args.device)
-
-    # Evaluation
-    results = {}
-    if args.do_eval:
-        tokenizer = tokenizer_class.from_pretrained(args.output_dir, do_lower_case=args.do_lower_case)
-        checkpoints = [args.output_dir]
-        if args.eval_all_checkpoints:
-            checkpoints = list(
-                os.path.dirname(c) for c in sorted(glob.glob(args.output_dir + "/**/" + WEIGHTS_NAME, recursive=True))
-            )
-            logging.getLogger("transformers.modeling_utils").setLevel(logging.WARN)  # Reduce logging
-        logger.info("Evaluate the following checkpoints: %s", checkpoints)
-        for checkpoint in checkpoints:
-            global_step = checkpoint.split("-")[-1] if len(checkpoints) > 1 else ""
-            prefix = checkpoint.split("/")[-1] if checkpoint.find("checkpoint") != -1 else ""
-
-            model = model_class.from_pretrained(checkpoint)
-            model.to(args.device)
-            result = evaluate(args, model, tokenizer, prefix=prefix, disable_logging=disable_logging)
-            result = dict((k + "_{}".format(global_step), v) for k, v in result.items())
-            results.update(result)
-
-    return results
-
-
-def get_args():
-    parser = argparse.ArgumentParser()
-
-    # Required parameters
-    parser.add_argument(
-        "--data_dir",
-        default=None,
-        type=str,
-        required=True,
-        help="The input data dir. Should contain the .tsv files (or other data files) for the task.",
-    )
-    parser.add_argument(
-        "--model_type",
-        default=None,
-        type=str,
-        required=True,
-        help="Model type selected in the list: " + ", ".join(MODEL_CLASSES.keys()),
-    )
-    parser.add_argument(
-        "--model_name_or_path",
-        default=None,
-        type=str,
-        required=True,
-        help="Path to pre-trained model or shortcut name selected in the list: " + ", ".join(ALL_MODELS),
-    )
-    parser.add_argument(
-        "--task_name",
-        default=None,
-        type=str,
-        required=True,
-        help="The name of the task to train selected in the list: " + ", ".join(processors.keys()),
-    )
-    parser.add_argument(
-        "--output_dir",
-        default=None,
-        type=str,
-        required=True,
-        help="The output directory where the model predictions and checkpoints will be written.",
-    )
-
-    # TPU Parameters
-    parser.add_argument("--num_cores", default=8, type=int, help="Number of TPU cores to use (1 or 8).")
-    parser.add_argument("--metrics_debug", action="store_true", help="Whether to print debug metrics.")
-
-    # Other parameters
-    parser.add_argument(
-        "--config_name", default="", type=str, help="Pretrained config name or path if not the same as model_name"
-    )
-    parser.add_argument(
-        "--tokenizer_name",
-        default="",
-        type=str,
-        help="Pretrained tokenizer name or path if not the same as model_name",
-    )
-    parser.add_argument(
-        "--cache_dir",
-        default="",
-        type=str,
-        help="Where do you want to store the pre-trained models downloaded and features file generated",
-    )
-    parser.add_argument(
-        "--max_seq_length",
-        default=128,
-        type=int,
-        help="The maximum total input sequence length after tokenization. Sequences longer "
-        "than this will be truncated, sequences shorter will be padded.",
-    )
-    parser.add_argument("--do_train", action="store_true", help="Whether to run training.")
-    parser.add_argument("--do_eval", action="store_true", help="Whether to run eval on the dev set.")
-    parser.add_argument(
-        "--evaluate_during_training", action="store_true", help="Rul evaluation during training at each logging step."
-    )
-    parser.add_argument(
-        "--do_lower_case", action="store_true", help="Set this flag if you are using an uncased model."
-    )
-
-    parser.add_argument("--train_batch_size", default=8, type=int, help="Per core batch size for training.")
-    parser.add_argument("--eval_batch_size", default=8, type=int, help="Per core batch size for evaluation.")
-    parser.add_argument(
-        "--gradient_accumulation_steps",
-        type=int,
-        default=1,
-        help="Number of updates steps to accumulate before performing a backward/update pass.",
-    )
-    parser.add_argument("--learning_rate", default=5e-5, type=float, help="The initial learning rate for Adam.")
-    parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight deay if we apply some.")
-    parser.add_argument("--adam_epsilon", default=1e-8, type=float, help="Epsilon for Adam optimizer.")
-    parser.add_argument("--max_grad_norm", default=1.0, type=float, help="Max gradient norm.")
-    parser.add_argument(
-        "--num_train_epochs", default=3.0, type=float, help="Total number of training epochs to perform."
-    )
-    parser.add_argument(
-        "--max_steps",
-        default=-1,
-        type=int,
-        help="If > 0: set total number of training steps to perform. Override num_train_epochs.",
-    )
-    parser.add_argument("--warmup_steps", default=0, type=int, help="Linear warmup over warmup_steps.")
-    parser.add_argument("--tensorboard_logdir", default="./runs", type=str, help="Where to write tensorboard metrics.")
-    parser.add_argument("--logging_steps", type=int, default=50, help="Log every X update steps.")
-    parser.add_argument("--only_log_master", action="store_true", help="Whether to log only from each hosts master.")
-    parser.add_argument("--save_steps", type=int, default=50, help="Save checkpoint every X update steps.")
-    parser.add_argument(
-        "--eval_all_checkpoints",
-        action="store_true",
-        help="Evaluate all checkpoints starting with the same prefix as model_name ending and ending with step number",
-    )
-    parser.add_argument(
-        "--overwrite_output_dir", action="store_true", help="Overwrite the content of the output directory"
-    )
-    parser.add_argument(
-        "--overwrite_cache", action="store_true", help="Overwrite the cached training and evaluation sets"
-    )
-    parser.add_argument("--seed", type=int, default=42, help="random seed for initialization")
-
-    return parser.parse_args()
-
-
-def _mp_fn(rank, args):
-    main(args)
-
-
-def main_cli():
-    args = get_args()
-    xmp.spawn(_mp_fn, args=(args,), nprocs=args.num_cores)
-
-
-if __name__ == "__main__":
-    main_cli()
--- a/examples/summarization/bart/finetune.py
+++ b/examples/summarization/bart/finetune.py
@@ -7,7 +7,7 @@ import time
 import torch
 from torch.utils.data import DataLoader

-from transformer_base import BaseTransformer, add_generic_args, generic_train, get_linear_schedule_with_warmup
+from lightning_base import BaseTransformer, add_generic_args, generic_train, get_linear_schedule_with_warmup


 try:

--- a/examples/summarization/bart/run_train.sh
+++ b/examples/summarization/bart/run_train.sh
@@ -5,7 +5,7 @@ export OUTPUT_DIR=${CURRENT_DIR}/${OUTPUT_DIR_NAME}
 # Make output directory if it doesn't exist
 mkdir -p $OUTPUT_DIR

-# Add parent directory to python path to access transformer_base.py
+# Add parent directory to python path to access lightning_base.py
 export PYTHONPATH="../../":"${PYTHONPATH}"

 python finetune.py \

--- a/examples/summarization/bart/run_train_tiny.sh
+++ b/examples/summarization/bart/run_train_tiny.sh
@@ -12,7 +12,7 @@ export OUTPUT_DIR=${CURRENT_DIR}/${OUTPUT_DIR_NAME}
 # Make output directory if it doesn't exist
 mkdir -p $OUTPUT_DIR

-# Add parent directory to python path to access transformer_base.py and utils.py
+# Add parent directory to python path to access lightning_base.py and utils.py
 export PYTHONPATH="../../":"${PYTHONPATH}"
 python finetune.py \
 --data_dir=cnn_tiny/ \

--- a/examples/test_examples.py
+++ b/examples/test_examples.py
@@ -16,14 +16,24 @@

 import argparse
 import logging
+import os
 import sys
 import unittest
 from unittest.mock import patch

-import run_generation
-import run_glue
-import run_language_modeling
-import run_squad
+
+SRC_DIRS = [
+    os.path.join(os.path.dirname(__file__), dirname)
+    for dirname in ["text-generation", "text-classification", "language-modeling", "question-answering"]
+]
+sys.path.extend(SRC_DIRS)
+
+
+if SRC_DIRS is not None:
+    import run_generation
+    import run_glue
+    import run_language_modeling
+    import run_squad


 logging.basicConfig(level=logging.DEBUG)
@@ -43,24 +53,24 @@ class ExamplesTests(unittest.TestCase):
        stream_handler = logging.StreamHandler(sys.stdout)
        logger.addHandler(stream_handler)

-        testargs = [
-            "run_glue.py",
-            "--data_dir=./examples/tests_samples/MRPC/",
-            "--task_name=mrpc",
-            "--do_train",
-            "--do_eval",
-            "--output_dir=./examples/tests_samples/temp_dir",
-            "--per_gpu_train_batch_size=2",
-            "--per_gpu_eval_batch_size=1",
-            "--learning_rate=1e-4",
-            "--max_steps=10",
-            "--warmup_steps=2",
-            "--overwrite_output_dir",
-            "--seed=42",
-            "--max_seq_length=128",
-        ]
-        model_name = "--model_name_or_path=bert-base-uncased"
-        with patch.object(sys, "argv", testargs + [model_name]):
+        testargs = """
+            run_glue.py
+            --model_name_or_path bert-base-uncased
+            --data_dir ./tests/fixtures/tests_samples/MRPC/
+            --task_name mrpc
+            --do_train
+            --do_eval
+            --output_dir ./tests/fixtures/tests_samples/temp_dir
+            --per_gpu_train_batch_size=2
+            --per_gpu_eval_batch_size=1
+            --learning_rate=1e-4
+            --max_steps=10
+            --warmup_steps=2
+            --overwrite_output_dir
+            --seed=42
+            --max_seq_length=128
+            """.split()
+        with patch.object(sys, "argv", testargs):
            result = run_glue.main()
            del result["loss"]
            for value in result.values():
@@ -78,7 +88,7 @@ class ExamplesTests(unittest.TestCase):
            --line_by_line
            --train_data_file ./tests/fixtures/sample_text.txt
            --eval_data_file ./tests/fixtures/sample_text.txt
-            --output_dir ./tests/fixtures
+            --output_dir ./tests/fixtures/tests_samples/temp_dir
            --overwrite_output_dir
            --do_train
            --do_eval
@@ -93,24 +103,25 @@ class ExamplesTests(unittest.TestCase):
        stream_handler = logging.StreamHandler(sys.stdout)
        logger.addHandler(stream_handler)

-        testargs = [
-            "run_squad.py",
-            "--data_dir=./examples/tests_samples/SQUAD",
-            "--model_name=bert-base-uncased",
-            "--output_dir=./examples/tests_samples/temp_dir",
-            "--max_steps=10",
-            "--warmup_steps=2",
-            "--do_train",
-            "--do_eval",
-            "--version_2_with_negative",
-            "--learning_rate=2e-4",
-            "--per_gpu_train_batch_size=2",
-            "--per_gpu_eval_batch_size=1",
-            "--overwrite_output_dir",
-            "--seed=42",
-        ]
-        model_type, model_name = ("--model_type=bert", "--model_name_or_path=bert-base-uncased")
-        with patch.object(sys, "argv", testargs + [model_type, model_name]):
+        testargs = """
+            run_squad.py
+            --model_type=bert
+            --model_name_or_path=bert-base-uncased
+            --data_dir=./tests/fixtures/tests_samples/SQUAD
+            --model_name=bert-base-uncased
+            --output_dir=./tests/fixtures/tests_samples/temp_dir
+            --max_steps=10
+            --warmup_steps=2
+            --do_train
+            --do_eval
+            --version_2_with_negative
+            --learning_rate=2e-4
+            --per_gpu_train_batch_size=2
+            --per_gpu_eval_batch_size=1
+            --overwrite_output_dir
+            --seed=42
+        """.split()
+        with patch.object(sys, "argv", testargs):
            result = run_squad.main()
            self.assertGreaterEqual(result["f1"], 30)
            self.assertGreaterEqual(result["exact"], 30)

--- a/examples/text-classification/README.md
+++ b/examples/text-classification/README.md
+## GLUE Benchmark
+
+# Run TensorFlow 2.0 version
+
+Based on the script [`run_tf_glue.py`](https://github.com/huggingface/transformers/blob/master/examples/run_tf_glue.py).
+
+Fine-tuning the library TensorFlow 2.0 Bert model for sequence classification on the  MRPC task of the GLUE benchmark: [General Language Understanding Evaluation](https://gluebenchmark.com/).
+
+This script has an option for mixed precision (Automatic Mixed Precision / AMP) to run models on Tensor Cores (NVIDIA Volta/Turing GPUs) and future hardware and an option for XLA, which uses the XLA compiler to reduce model runtime.
+Options are toggled using `USE_XLA` or `USE_AMP` variables in the script.
+These options and the below benchmark are provided by @tlkh.
+
+Quick benchmarks from the script (no other modifications):
+
+| GPU    | Mode | Time (2nd epoch) | Val Acc (3 runs) |
+| --------- | -------- | ----------------------- | ----------------------|
+| Titan V | FP32 | 41s | 0.8438/0.8281/0.8333 |
+| Titan V | AMP | 26s | 0.8281/0.8568/0.8411 |
+| V100    | FP32 | 35s | 0.8646/0.8359/0.8464 |
+| V100    | AMP | 22s | 0.8646/0.8385/0.8411 |
+| 1080 Ti | FP32 | 55s | - |
+
+Mixed precision (AMP) reduces the training time considerably for the same hardware and hyper-parameters (same batch size was used).
+
+
+
+# Run PyTorch version
+
+Based on the script [`run_glue.py`](https://github.com/huggingface/transformers/blob/master/examples/text-classification/run_glue.py).
+
+Fine-tuning the library models for sequence classification on the GLUE benchmark: [General Language Understanding
+Evaluation](https://gluebenchmark.com/). This script can fine-tune the following models: BERT, XLM, XLNet and RoBERTa.
+
+GLUE is made up of a total of 9 different tasks. We get the following results on the dev set of the benchmark with an
+uncased  BERT base model (the checkpoint `bert-base-uncased`). All experiments ran single V100 GPUs with a total train
+batch sizes between 16 and 64. Some of these tasks have a small dataset and training can lead to high variance in the results
+between different runs. We report the median on 5 runs (with different seeds) for each of the metrics.
+
+| Task  | Metric                       | Result      |
+|-------|------------------------------|-------------|
+| CoLA  | Matthew's corr               | 49.23       |
+| SST-2 | Accuracy                     | 91.97       |
+| MRPC  | F1/Accuracy                  | 89.47/85.29 |
+| STS-B | Person/Spearman corr.        | 83.95/83.70 |
+| QQP   | Accuracy/F1                  | 88.40/84.31 |
+| MNLI  | Matched acc./Mismatched acc. | 80.61/81.08 |
+| QNLI  | Accuracy                     | 87.46       |
+| RTE   | Accuracy                     | 61.73       |
+| WNLI  | Accuracy                     | 45.07       |
+
+Some of these results are significantly different from the ones reported on the test set
+of GLUE benchmark on the website. For QQP and WNLI, please refer to [FAQ #12](https://gluebenchmark.com/faq) on the webite.
+
+Before running any one of these GLUE tasks you should download the
+[GLUE data](https://gluebenchmark.com/tasks) by running
+[this script](https://gist.github.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e)
+and unpack it to some directory `$GLUE_DIR`.
+
+```bash
+export GLUE_DIR=/path/to/glue
+export TASK_NAME=MRPC
+
+python run_glue.py \
+  --model_type bert \
+  --model_name_or_path bert-base-cased \
+  --task_name $TASK_NAME \
+  --do_train \
+  --do_eval \
+  --data_dir $GLUE_DIR/$TASK_NAME \
+  --max_seq_length 128 \
+  --per_gpu_train_batch_size 32 \
+  --learning_rate 2e-5 \
+  --num_train_epochs 3.0 \
+  --output_dir /tmp/$TASK_NAME/
+```
+
+where task name can be one of CoLA, SST-2, MRPC, STS-B, QQP, MNLI, QNLI, RTE, WNLI.
+
+The dev set results will be present within the text file `eval_results.txt` in the specified output_dir.
+In case of MNLI, since there are two separate dev sets (matched and mismatched), there will be a separate
+output folder called `/tmp/MNLI-MM/` in addition to `/tmp/MNLI/`.
+
+The code has not been tested with half-precision training with apex on any GLUE task apart from MRPC, MNLI,
+CoLA, SST-2. The following section provides details on how to run half-precision training with MRPC. With that being
+said, there shouldn’t be any issues in running half-precision training with the remaining GLUE tasks as well,
+since the data processor for each task inherits from the base class DataProcessor.
+
+## Running on TPUs
+
+You can accelerate your workloads on Google's TPUs. For information on how to setup your TPU environment refer to this
+[README](https://github.com/pytorch/xla/blob/master/README.md).
+
+The following are some examples of running the `*_tpu.py` finetuning scripts on TPUs. All steps for data preparation are
+identical to your normal GPU + Huggingface setup.
+
+For running your GLUE task on MNLI dataset you can run something like the following:
+
+```
+export XRT_TPU_CONFIG="tpu_worker;0;$TPU_IP_ADDRESS:8470"
+export GLUE_DIR=/path/to/glue
+export TASK_NAME=MNLI
+
+python run_glue_tpu.py \
+  --model_type bert \
+  --model_name_or_path bert-base-cased \
+  --task_name $TASK_NAME \
+  --do_train \
+  --do_eval \
+  --data_dir $GLUE_DIR/$TASK_NAME \
+  --max_seq_length 128 \
+  --train_batch_size 32 \
+  --learning_rate 3e-5 \
+  --num_train_epochs 3.0 \
+  --output_dir /tmp/$TASK_NAME \
+  --overwrite_output_dir \
+  --logging_steps 50 \
+  --save_steps 200 \
+  --num_cores=8 \
+  --only_log_master
+```
+
+### MRPC
+
+#### Fine-tuning example
+
+The following examples fine-tune BERT on the Microsoft Research Paraphrase Corpus (MRPC) corpus and runs in less
+than 10 minutes on a single K-80 and in 27 seconds (!) on single tesla V100 16GB with apex installed.
+
+Before running any one of these GLUE tasks you should download the
+[GLUE data](https://gluebenchmark.com/tasks) by running
+[this script](https://gist.github.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e)
+and unpack it to some directory `$GLUE_DIR`.
+
+```bash
+export GLUE_DIR=/path/to/glue
+
+python run_glue.py \
+  --model_name_or_path bert-base-cased \
+  --task_name MRPC \
+  --do_train \
+  --do_eval \
+  --data_dir $GLUE_DIR/MRPC/ \
+  --max_seq_length 128 \
+  --per_gpu_train_batch_size 32 \
+  --learning_rate 2e-5 \
+  --num_train_epochs 3.0 \
+  --output_dir /tmp/mrpc_output/
+```
+
+Our test ran on a few seeds with [the original implementation hyper-
+parameters](https://github.com/google-research/bert#sentence-and-sentence-pair-classification-tasks) gave evaluation
+results between 84% and 88%.
+
+#### Using Apex and mixed-precision
+
+Using Apex and 16 bit precision, the fine-tuning on MRPC only takes 27 seconds. First install
+[apex](https://github.com/NVIDIA/apex), then run the following example:
+
+```bash
+export GLUE_DIR=/path/to/glue
+
+python run_glue.py \
+  --model_name_or_path bert-base-cased \
+  --task_name MRPC \
+  --do_train \
+  --do_eval \
+  --data_dir $GLUE_DIR/MRPC/ \
+  --max_seq_length 128 \
+  --per_gpu_train_batch_size 32 \
+  --learning_rate 2e-5 \
+  --num_train_epochs 3.0 \
+  --output_dir /tmp/mrpc_output/ \
+  --fp16
+```
+
+#### Distributed training
+
+Here is an example using distributed training on 8 V100 GPUs. The model used is the BERT whole-word-masking and it
+reaches F1 > 92 on MRPC.
+
+```bash
+export GLUE_DIR=/path/to/glue
+
+python -m torch.distributed.launch \
+    --nproc_per_node 8 run_glue.py \
+    --model_name_or_path bert-base-cased \
+    --task_name MRPC \
+    --do_train \
+    --do_eval \
+    --data_dir $GLUE_DIR/MRPC/ \
+    --max_seq_length 128 \
+    --per_gpu_train_batch_size 8 \
+    --learning_rate 2e-5 \
+    --num_train_epochs 3.0 \
+    --output_dir /tmp/mrpc_output/
+```
+
+Training with these hyper-parameters gave us the following results:
+
+```bash
+acc = 0.8823529411764706
+acc_and_f1 = 0.901702786377709
+eval_loss = 0.3418912578906332
+f1 = 0.9210526315789473
+global_step = 174
+loss = 0.07231863956341798
+```
+
+### MNLI
+
+The following example uses the BERT-large, uncased, whole-word-masking model and fine-tunes it on the MNLI task.
+
+```bash
+export GLUE_DIR=/path/to/glue
+
+python -m torch.distributed.launch \
+    --nproc_per_node 8 run_glue.py \
+    --model_name_or_path bert-base-cased \
+    --task_name mnli \
+    --do_train \
+    --do_eval \
+    --data_dir $GLUE_DIR/MNLI/ \
+    --max_seq_length 128 \
+    --per_gpu_train_batch_size 8 \
+    --learning_rate 2e-5 \
+    --num_train_epochs 3.0 \
+    --output_dir output_dir \
+```
+
+The results  are the following:
+
+```bash
+***** Eval results *****
+  acc = 0.8679706601466992
+  eval_loss = 0.4911287787382479
+  global_step = 18408
+  loss = 0.04755385363816904
+
+***** Eval results *****
+  acc = 0.8747965825874695
+  eval_loss = 0.45516540421714036
+  global_step = 18408
+  loss = 0.04755385363816904
+```
+
+# Run PyTorch version using PyTorch-Lightning
+
+Run `bash run_pl.sh` from the `glue` directory. This will also install `pytorch-lightning` and the requirements in `examples/requirements.txt`. It is a shell pipeline that will automatically download, pre-process the data and run the specified models. Logs are saved in `lightning_logs` directory.
+
+Pass `--n_gpu` flag to change the number of GPUs. Default uses 1. At the end, the expected results are: 
+
+```
+TEST RESULTS {'val_loss': tensor(0.0707), 'precision': 0.852427800698191, 'recall': 0.869537067011978, 'f1': 0.8608974358974358}
+```
+
+
+# XNLI
+
+Based on the script [`run_xnli.py`](https://github.com/huggingface/transformers/blob/master/examples/run_xnli.py).
+
+[XNLI](https://www.nyu.edu/projects/bowman/xnli/) is crowd-sourced dataset based on [MultiNLI](http://www.nyu.edu/projects/bowman/multinli/). It is an evaluation benchmark for cross-lingual text representations. Pairs of text are labeled with textual entailment annotations for 15 different languages (including both high-resource language such as English and low-resource languages such as Swahili).
+
+#### Fine-tuning on XNLI
+
+This example code fine-tunes mBERT (multi-lingual BERT) on the XNLI dataset. It runs in 106 mins
+on a single tesla V100 16GB. The data for XNLI can be downloaded with the following links and should be both saved (and un-zipped) in a
+`$XNLI_DIR` directory.
+
+* [XNLI 1.0](https://www.nyu.edu/projects/bowman/xnli/XNLI-1.0.zip)
+* [XNLI-MT 1.0](https://www.nyu.edu/projects/bowman/xnli/XNLI-MT-1.0.zip)
+
+```bash
+export XNLI_DIR=/path/to/XNLI
+
+python run_xnli.py \
+  --model_type bert \
+  --model_name_or_path bert-base-multilingual-cased \
+  --language de \
+  --train_language en \
+  --do_train \
+  --do_eval \
+  --data_dir $XNLI_DIR \
+  --per_gpu_train_batch_size 32 \
+  --learning_rate 5e-5 \
+  --num_train_epochs 2.0 \
+  --max_seq_length 128 \
+  --output_dir /tmp/debug_xnli/ \
+  --save_steps -1
+```
+
+Training with the previously defined hyper-parameters yields the following results on the **test** set:
+
+```bash
+acc = 0.7093812375249501
+```
+
+
+
+
--- a/examples/run_glue.py
+++ b/examples/run_glue.py
--- a/examples/glue/run_pl.sh
+++ b/examples/glue/run_pl.sh
@@ -20,7 +20,7 @@ export OUTPUT_DIR=${CURRENT_DIR}/${OUTPUT_DIR_NAME}

 # Make output directory if it doesn't exist
 mkdir -p $OUTPUT_DIR
-# Add parent directory to python path to access transformer_base.py
+# Add parent directory to python path to access lightning_base.py
 export PYTHONPATH="../":"${PYTHONPATH}"

 python3 run_pl_glue.py --data_dir $DATA_DIR \

--- a/examples/glue/run_pl_glue.py
+++ b/examples/glue/run_pl_glue.py
@@ -8,7 +8,7 @@ import numpy as np
 import torch
 from torch.utils.data import DataLoader, TensorDataset

-from transformer_base import BaseTransformer, add_generic_args, generic_train
+from lightning_base import BaseTransformer, add_generic_args, generic_train
 from transformers import glue_compute_metrics as compute_metrics
 from transformers import glue_convert_examples_to_features as convert_examples_to_features
 from transformers import glue_output_modes

--- a/examples/run_tf_glue.py
+++ b/examples/run_tf_glue.py
--- a/examples/run_xnli.py
+++ b/examples/run_xnli.py
@@ -14,7 +14,7 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 """ Finetuning multi-lingual models on XNLI (Bert, DistilBERT, XLM).
-    Adapted from `examples/run_glue.py`"""
+    Adapted from `examples/text-classification/run_glue.py`"""


 import argparse

--- a/examples/text-generation/README.md
+++ b/examples/text-generation/README.md
+## Language generation
+
+Based on the script [`run_generation.py`](https://github.com/huggingface/transformers/blob/master/examples/run_generation.py).
+
+Conditional text generation using the auto-regressive models of the library: GPT, GPT-2, Transformer-XL, XLNet, CTRL.
+A similar script is used for our official demo [Write With Transfomer](https://transformer.huggingface.co), where you
+can try out the different models available in the library.
+
+Example usage:
+
+```bash
+python run_generation.py \
+    --model_type=gpt2 \
+    --model_name_or_path=gpt2
+```
--- a/examples/pplm/README.md
+++ b/examples/pplm/README.md
--- a/examples/pplm/imgs/headfigure.png
+++ b/examples/pplm/imgs/headfigure.png
--- a/examples/pplm/imgs/wooly.png
+++ b/examples/pplm/imgs/wooly.png