Internal change

PiperOrigin-RevId: 375548215

Internal change
PiperOrigin-RevId: 375548215
86ca3ebb · Hongkun Yu · A. Unique TensorFlower · 87eed62f · 86ca3ebb · 86ca3ebb
Commit 86ca3ebb authored May 24, 2021 by Hongkun Yu Committed by A. Unique TensorFlower May 24, 2021
5 changed files
--- a/official/nlp/configs/encoders.py
+++ b/official/nlp/configs/encoders.py
@@ -26,6 +26,7 @@ from official.modeling import hyperparams
 from official.modeling import tf_utils
 from official.nlp.modeling import layers
 from official.nlp.modeling import networks
+from official.nlp.projects.bigbird import encoder as bigbird_encoder
 @dataclasses.dataclass
@@ -293,9 +294,26 @@ def build_encoder(config: EncoderConfig,
        dict_outputs=True)
  if encoder_type == "bigbird":
-    # TODO(frederickliu): Support use_gradient_checkpointing.
+    # TODO(frederickliu): Support use_gradient_checkpointing and update
+    # experiments to use the EncoderScaffold only.
    if encoder_cfg.use_gradient_checkpointing:
-      raise ValueError("Gradient checkpointing unsupported at the moment.")
+      return bigbird_encoder.BigBirdEncoder(
+          vocab_size=encoder_cfg.vocab_size,
+          hidden_size=encoder_cfg.hidden_size,
+          num_layers=encoder_cfg.num_layers,
+          num_attention_heads=encoder_cfg.num_attention_heads,
+          intermediate_size=encoder_cfg.intermediate_size,
+          activation=tf_utils.get_activation(encoder_cfg.hidden_activation),
+          dropout_rate=encoder_cfg.dropout_rate,
+          attention_dropout_rate=encoder_cfg.attention_dropout_rate,
+          num_rand_blocks=encoder_cfg.num_rand_blocks,
+          block_size=encoder_cfg.block_size,
+          max_position_embeddings=encoder_cfg.max_position_embeddings,
+          type_vocab_size=encoder_cfg.type_vocab_size,
+          initializer=tf.keras.initializers.TruncatedNormal(
+              stddev=encoder_cfg.initializer_range),
+          embedding_width=encoder_cfg.embedding_width,
+          use_gradient_checkpointing=encoder_cfg.use_gradient_checkpointing)
    embedding_cfg = dict(
        vocab_size=encoder_cfg.vocab_size,
        type_vocab_size=encoder_cfg.type_vocab_size,

--- a/official/nlp/projects/bigbird/README.md
+++ b/official/nlp/projects/bigbird/README.md
+# BigBird: Transformers for Longer Sequences
+[BigBird](https://arxiv.org/abs/2007.14062)
+is a sparse attention mechanism that reduces this quadratic dependency to
+linear. BigBird is a universal approximator of sequence functions and is Turing
+complete, thereby preserving these properties of the quadratic, full attention
+model. Along the way, our theoretical analysis reveals some of the benefits of
+having O(1) global tokens (such as CLS), that attend to the entire sequence as
+part of the sparse attention mechanism.
+### Requirements
+The starter code requires Tensorflow. If you haven't installed it yet, follow
+the instructions on [tensorflow.org][1].
+This code has been tested with Tensorflow 2.5.0. Going forward,
+we will continue to target the latest released version of Tensorflow.
+Please verify that you have Python 3.6+ and Tensorflow 2.5.0 or higher
+installed by running the following commands:
+```sh
+python --version
+python -c 'import tensorflow as tf; print(tf.__version__)'
+```
+Refer to the [instructions here][2]
+for using the model in this repo. Make sure to add the models folder to your
+Python path.
+[1]: https://www.tensorflow.org/install/
+[2]:
+https://github.com/tensorflow/models/tree/master/official#running-the-models
+## Network Implementations
+We implement the encoder and layers using `tf.keras` APIs in NLP
+modeling library:
+  * [bigbird_attention.py](https://github.com/tensorflow/models/blob/master/official/nlp/modeling/layers/bigbird_attention.py)
+  contains the BigBird sparse attention implementation.
+  * [encoders.py](https://github.com/tensorflow/models/blob/master/official/nlp/configs/encoders.py)
+  contains the integration of BigBird attention to the `EncoderScaffold`. Note
+  that, currently the gradient checkpointing is implemented in
+  [bigbird/encoder.py](https://github.com/tensorflow/models/blob/master/official/nlp/projects/bigbird/encoder.py).
+## Train using the config file.
+Create a YAML file for specifying the parameters to be overridden.
+Working examples can be found in `bigbird/experiments` directory.
+The code can be run in different modes: `train / train_and_eval / eval`.
+Run [`official/nlp/train.py`](https://github.com/tensorflow/models/blob/master/official/nlp/train.py)
+and specify which mode you wish to execute.
+### Data processing
+The script to process training data is the same as the BERT. Please check out
+the [instructions](https://github.com/tensorflow/models/blob/master/official/nlp/docs/train.md#fine-tuning-sentence-classification-with-bert-from-tf-hub).
+The sentence piece vocabulary file can be downloaded [here](https://storage.googleapis.com/tf_model_garden/nlp/bigbird/vocab_sp.model).
+### GLUE
+The following commands will train and evaluate a model on GLUE datasets on TPUs.
+If you are using GPUs, just remove the `--tpu` flag and set
+`runtime.distribution_strategy` to `mirrored` to use the
+[`tf.distribute.MirroredStrategy`](https://www.tensorflow.org/api_docs/python/tf/distribute/MirroredStrategy).
+```bash
+INIT_CKPT=???
+TRAIN_FILE=???
+EVAL_FILE=???
+python3 official/nlp/train.py \
+   --experiment_type=bigbird/glue \
+   --config_file=experiments/glue_mnli_matched.yaml \
+   --params_override=task.init_checkpoint=${INIT_CKPT} \
+   --params_override=runtime.distribution_strategy=tpu \
+   --tpu=??? \
+   --mode=train_and_eval
+```
+### SQuAD
+The following commands will train and evaluate a model on SQuAD datasets.
+```bash
+VOCAB_FILE=???
+TRAIN_FILE=???
+EVAL_FILE=???
+python3 official/nlp/train.py \
+   --experiment_type=bigbird/squad \
+   --config_file=third_party/tensorflow_models/official/nlp/projects/bigbird/experiments/squad_v1.yaml \
+   --params_override=task.init_checkpoint=${INIT_CKPT} \
+   --params_override=task.train_data.input_path=${TRAIN_FILE},task.validation_data.input_path=${EVAL_FILE},task.validation_data.vocab_file=${VOCAB_FILE} \
+   --params_override=runtime.distribution_strategy=tpu \
+   --tpu=??? \
+   --mode=train_and_eval
+```
--- a/official/nlp/projects/bigbird/experiment_configs.py
+++ b/official/nlp/projects/bigbird/experiment_configs.py
+# Copyright 2021 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Bigbird experiment configurations."""
+# pylint: disable=g-doc-return-or-yield,line-too-long
+from official.core import config_definitions as cfg
+from official.core import exp_factory
+from official.modeling import optimization
+from official.nlp.data import question_answering_dataloader
+from official.nlp.data import sentence_prediction_dataloader
+from official.nlp.tasks import question_answering
+from official.nlp.tasks import sentence_prediction
+@exp_factory.register_config_factory('bigbird/glue')
+def bigbird_glue() -> cfg.ExperimentConfig:
+  r"""BigBird GLUE."""
+  config = cfg.ExperimentConfig(
+      task=sentence_prediction.SentencePredictionConfig(
+          train_data=sentence_prediction_dataloader
+          .SentencePredictionDataConfig(),
+          validation_data=sentence_prediction_dataloader
+          .SentencePredictionDataConfig(
+              is_training=False, drop_remainder=False)),
+      trainer=cfg.TrainerConfig(
+          optimizer_config=optimization.OptimizationConfig({
+              'optimizer': {
+                  'type': 'adamw',
+                  'adamw': {
+                      'weight_decay_rate':
+                          0.01,
+                      'exclude_from_weight_decay':
+                          ['LayerNorm', 'layer_norm', 'bias'],
+                  }
+              },
+              'learning_rate': {
+                  'type': 'polynomial',
+                  'polynomial': {
+                      'initial_learning_rate': 3e-5,
+                      'end_learning_rate': 0.0,
+                  }
+              },
+              'warmup': {
+                  'type': 'polynomial'
+              }
+          })),
+      restrictions=[
+          'task.train_data.is_training != None',
+          'task.validation_data.is_training != None'
+      ])
+  config.task.model.encoder.type = 'bigbird'
+  return config
+@exp_factory.register_config_factory('bigbird/squad')
+def bigbird_squad() -> cfg.ExperimentConfig:
+  r"""BigBird Squad V1/V2."""
+  config = cfg.ExperimentConfig(
+      task=question_answering.QuestionAnsweringConfig(
+          train_data=question_answering_dataloader.QADataConfig(),
+          validation_data=question_answering_dataloader.QADataConfig()),
+      trainer=cfg.TrainerConfig(
+          optimizer_config=optimization.OptimizationConfig({
+              'optimizer': {
+                  'type': 'adamw',
+                  'adamw': {
+                      'weight_decay_rate':
+                          0.01,
+                      'exclude_from_weight_decay':
+                          ['LayerNorm', 'layer_norm', 'bias'],
+                  }
+              },
+              'learning_rate': {
+                  'type': 'polynomial',
+                  'polynomial': {
+                      'initial_learning_rate': 8e-5,
+                      'end_learning_rate': 0.0,
+                  }
+              },
+              'warmup': {
+                  'type': 'polynomial'
+              }
+          })),
+      restrictions=[
+          'task.train_data.is_training != None',
+          'task.validation_data.is_training != None'
+      ])
+  config.task.model.encoder.type = 'bigbird'
+  return config
--- a/official/nlp/projects/bigbird/experiments/glue_mnli_matched.yaml
+++ b/official/nlp/projects/bigbird/experiments/glue_mnli_matched.yaml
+task:
+  hub_module_url: ''
+  model:
+    num_classes: 3
+    encoder:
+      type: bigbird
+      bigbird:
+        use_gradient_checkpointing: false
+        # hidden_size: 768
+        # num_layers: 12
+        # num_attention_heads: 12
+        # intermediate_size: 3072
+  init_checkpoint: 'TODO'
+  metric_type: 'accuracy'
+  train_data:
+    drop_remainder: true
+    global_batch_size: 32
+    input_path: 'TODO'
+    is_training: true
+    seq_length: 1024
+    label_type: 'int'
+  validation_data:
+    drop_remainder: false
+    global_batch_size: 32
+    input_path: 'TODO'
+    is_training: false
+    seq_length: 1024
+    label_type: 'int'
+trainer:
+  checkpoint_interval: 3000
+  optimizer_config:
+    learning_rate:
+      polynomial:
+        # 100% of train_steps.
+        decay_steps: 36813
+        end_learning_rate: 0.0
+        initial_learning_rate: 3.0e-05
+        power: 1.0
+      type: polynomial
+    optimizer:
+      type: adamw
+    warmup:
+      polynomial:
+        power: 1
+        # ~10% of train_steps.
+        warmup_steps: 3681
+      type: polynomial
+  steps_per_loop: 1000
+  summary_interval: 1000
+  # Training data size 392,702 examples, 3 epochs.
+  train_steps: 36813
+  validation_interval: 6135
+  # Eval data size = 9815 examples.
+  validation_steps: 307
+  best_checkpoint_export_subdir: 'best_ckpt'
+  best_checkpoint_eval_metric: 'cls_accuracy'
+  best_checkpoint_metric_comp: 'higher'
--- a/official/nlp/projects/bigbird/experiments/squad_v1.yaml
+++ b/official/nlp/projects/bigbird/experiments/squad_v1.yaml
+task:
+  hub_module_url: ''
+  model:
+    encoder:
+      type: bigbird
+      bigbird:
+        use_gradient_checkpointing: false
+        # hidden_size: 768
+        # num_layers: 12
+        # num_attention_heads: 12
+        # intermediate_size: 3072
+  max_answer_length: 30
+  n_best_size: 20
+  null_score_diff_threshold: 0.0
+  init_checkpoint: 'TODO'
+  train_data:
+    drop_remainder: true
+    global_batch_size: 48
+    input_path: 'TODO'
+    is_training: true
+    seq_length: 1024
+  validation_data:
+    do_lower_case: true
+    doc_stride: 128
+    drop_remainder: false
+    global_batch_size: 48
+    input_path: 'TODO'
+    is_training: false
+    query_length: 64
+    seq_length: 1024
+    tokenization: SentencePiece
+    version_2_with_negative: false
+    vocab_file: 'TODO'
+trainer:
+  checkpoint_interval: 1000
+  max_to_keep: 5
+  optimizer_config:
+    learning_rate:
+      polynomial:
+        decay_steps: 3699
+        end_learning_rate: 0.0
+        initial_learning_rate: 8.0e-05
+        power: 1.0
+      type: polynomial
+    optimizer:
+      type: adamw
+    warmup:
+      polynomial:
+        power: 1
+        warmup_steps: 370
+      type: polynomial
+  steps_per_loop: 1000
+  summary_interval: 1000
+  train_steps: 3699
+  validation_interval: 1000
+  validation_steps: 226
+  best_checkpoint_export_subdir: 'best_ckpt'
+  best_checkpoint_eval_metric: 'final_f1'
+  best_checkpoint_metric_comp: 'higher'