Internal change

PiperOrigin-RevId: 404080616

Internal change
PiperOrigin-RevId: 404080616
9114f2a3 · A. Unique TensorFlower · saberkun · ec0d7d0b · 9114f2a3 · 9114f2a3
Commit 9114f2a3 authored Oct 18, 2021 by A. Unique TensorFlower Committed by saberkun Oct 18, 2021
20 changed files
--- a/official/projects/edgetpu/nlp/README.md
+++ b/official/projects/edgetpu/nlp/README.md
+# MobileBERT-EdgeTPU
+
+<figure align="center">
+<img width=70% src=https://storage.googleapis.com/tf_model_garden/models/edgetpu/images/readme-mobilebert.png>
+  <figcaption>Performance of MobileBERT-EdgeTPU models on the SQuAD v1.1 dataset.</figcaption>
+</figure>
+
+Note: For MobileBERT baseline float model, NNAPI delegates parts of the
+computing ops to CPU, making the latency much higher.
+
+Note: The accuracy numbers for BERT_base and BERT_large are from the
+[training results](https://arxiv.org/abs/1810.04805). These models are too large
+and not feasible to run on device.
+
+Deploying low-latency, high-quality transformer based language models on device
+is highly desirable, and can potentially benefit multiple applications such as
+automatic speech recognition (ASR), translation, sentence autocompletion, and
+even some vision tasks. By co-designing the neural networks with the Edge TPU
+hardware accelerator in Google Tensor SoC, we have built EdgeTPU-customized
+MobileBERT models that demonstrate datacenter model quality meanwhile
+outperforms baseline MobileBERT's latency.
+
+We set up our model architecture search space based on
+[MobileBERT](https://arxiv.org/abs/2004.02984) and leverage AutoML algorithms to
+find models with up to 2x better hardware utilization. With higher utilization,
+we are able to bring larger and more accurate models on chip, and meanwhile the
+models can still outperform the baseline MobileBERT latency. We built a
+customized distillation training pipeline and performed exhaustive
+hyperparameters (e.g. learning rate, dropout ratio, etc) search to achieve the
+best accuracy. As shown in the above figure, the quantized MobileBERT-EdgeTPU
+models establish a new pareto-frontier for the question answering tasks and also
+exceed the accuracy of the float BERT_base model which is 400+MB and too large
+to run on edge devices.
+
+We also observed that, unlike most vision models, the accuracy drops
+significantly for MobileBERT/MobileBERT-EdgeTPU with plain post training
+quantization (PTQ) or quantization aware training (QAT). Proper model
+modifications, such as clipping the mask value, are necessary to retain the
+accuracy for a quantized model. Therefore, as an alternative to the quant
+models, we also provide a set of Edge TPU friendly float models which also
+produce a better (though marginally) roofline than the baseline MobileBERT quant
+model. Notably, the float MobileBERT-EdgeTPU-M model yields accuracy that is
+even close to the BERT_large, which has 1.3GB model size in float precision.
+Quantization now becomes an optional optimization rather than a prerequisite,
+which can greatly benefit/unblock some use cases where quantization is
+infeasible or introduce large accuracy deterioration, and potentially reduce the
+time-to-market.
+
+## Pre-trained Models
+
+Model name            | # Parameters | # Ops  |  MLM   | Checkpoint | TFhub link
+--------------------- | :----------: | :----: | :---: | :---: | :--------:
+MobileBERT-EdgeTPU-M  | 50.9M        | 18.8e9 |  73.8% | WIP | WIP
+MobileBERT-EdgeTPU-S  | 38.3M        | 14.0e9 |  72.8% | WIP | WIP
+MobileBERT-EdgeTPU-XS | 27.1M        | 9.4e9  |  71.2% | WIP | WIP
+
+### Restoring from Checkpoints
+
+To load the pre-trained MobileBERT checkpoint in your code, please follow the
+example below or check the `serving/export_tflite_squad` module:
+
+```python
+import tensorflow as tf
+from official.nlp.projects.mobilebert_edgetpu import params
+
+bert_config_file = ...
+model_checkpoint_path = ...
+
+# Set up experiment params and load the configs from file/files.
+experiment_params = params.EdgeTPUBERTCustomParams()
+
+# change the input mask type to tf.float32 to avoid additional casting op.
+experiment_params.student_model.encoder.mobilebert.input_mask_dtype = 'float32'
+pretrainer_model = model_builder.build_bert_pretrainer(
+    experiment_params.student_model,
+    name='pretrainer',
+    quantization_friendly=True)
+
+checkpoint_dict = {'model': pretrainer_model}
+checkpoint = tf.train.Checkpoint(**checkpoint_dict)
+checkpoint.restore(FLAGS.model_checkpoint).assert_existing_objects_matched()
+```
+
+### Use TF-Hub models
+
+TODO(longy): Update with instructions to use tf-hub models
--- a/official/projects/edgetpu/nlp/__init__.py
+++ b/official/projects/edgetpu/nlp/__init__.py
+# Copyright 2021 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
--- a/official/projects/edgetpu/nlp/configs/__init__.py
+++ b/official/projects/edgetpu/nlp/configs/__init__.py
+# Copyright 2021 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
--- a/official/projects/edgetpu/nlp/configs/params.py
+++ b/official/projects/edgetpu/nlp/configs/params.py
+# Copyright 2021 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Datastructures for all the configurations for MobileBERT-EdgeTPU training."""
+import dataclasses
+from typing import Optional
+
+from official.modeling import optimization
+from official.modeling.hyperparams import base_config
+from official.nlp.configs import bert
+from official.nlp.data import pretrain_dataloader
+
+DatasetParams = pretrain_dataloader.BertPretrainDataConfig
+PretrainerModelParams = bert.PretrainerConfig
+
+
+@dataclasses.dataclass
+class OrbitParams(base_config.Config):
+  """Parameters that setup Orbit training/evaluation pipeline.
+
+  Attributes:
+    mode: Orbit controller mode, can be 'train', 'train_and_evaluate', or
+      'evaluate'.
+    steps_per_loop: The number of steps to run in each inner loop of training.
+    total_steps: The global step count to train up to.
+    eval_steps: The number of steps to run during an evaluation. If -1, this
+      method will evaluate over the entire evaluation dataset.
+    eval_interval: The number of training steps to run between evaluations. If
+      set, training will always stop every `eval_interval` steps, even if this
+      results in a shorter inner loop than specified by `steps_per_loop`
+      setting. If None, evaluation will only be performed after training is
+      complete.
+  """
+  mode: str = 'train'
+  steps_per_loop: int = 1000
+  total_steps: int = 1000000
+  eval_steps: int = -1
+  eval_interval: Optional[int] = None
+
+
+@dataclasses.dataclass
+class OptimizerParams(optimization.OptimizationConfig):
+  """Optimizer parameters for MobileBERT-EdgeTPU."""
+  optimizer: optimization.OptimizerConfig = optimization.OptimizerConfig(
+      type='adamw',
+      adamw=optimization.AdamWeightDecayConfig(
+          weight_decay_rate=0.01,
+          exclude_from_weight_decay=['LayerNorm', 'layer_norm', 'bias']))
+  learning_rate: optimization.LrConfig = optimization.LrConfig(
+      type='polynomial',
+      polynomial=optimization.PolynomialLrConfig(
+          initial_learning_rate=1e-4,
+          decay_steps=1000000,
+          end_learning_rate=0.0))
+  warmup: optimization.WarmupConfig = optimization.WarmupConfig(
+      type='polynomial',
+      polynomial=optimization.PolynomialWarmupConfig(warmup_steps=10000))
+
+
+@dataclasses.dataclass
+class RuntimeParams(base_config.Config):
+  """Parameters that set up the training runtime.
+
+  TODO(longy): Can reuse the Runtime Config in:
+  official/core/config_definitions.py
+
+  Attributes
+    distribution_strategy: Keras distribution strategy
+    use_gpu: Whether to use GPU
+    use_tpu: Whether to use TPU
+    num_gpus: Number of gpus to use for training
+    num_workers: Number of parallel workers
+    tpu_address: The bns address of the TPU to use.
+  """
+  distribution_strategy: str = 'off'
+  num_gpus: Optional[int] = 0
+  all_reduce_alg: Optional[str] = None
+  num_workers: int = 1
+  tpu_address: str = ''
+  use_gpu: Optional[bool] = None
+  use_tpu: Optional[bool] = None
+
+
+@dataclasses.dataclass
+class LayerWiseDistillationParams(base_config.Config):
+  """Define the behavior of layer-wise distillation.
+
+  Layer-wise distillation is an optional step where the knowledge is transferred
+  layerwisely for all the transformer layers. The end-to-end distillation is
+  performed after layer-wise distillation if layer-wise distillation steps is
+  not zero.
+  """
+  num_steps: int = 10000
+  warmup_steps: int = 10000
+  initial_learning_rate: float = 1.5e-3
+  end_learning_rate: float = 1.5e-3
+  decay_steps: int = 10000
+  hidden_distill_factor: float = 100.0
+  beta_distill_factor: float = 5000.0
+  gamma_distill_factor: float = 5.0
+  attention_distill_factor: float = 1.0
+
+
+@dataclasses.dataclass
+class EndToEndDistillationParams(base_config.Config):
+  """Define the behavior of end2end pretrainer distillation."""
+  num_steps: int = 580000
+  warmup_steps: int = 20000
+  initial_learning_rate: float = 1.5e-3
+  end_learning_rate: float = 1.5e-7
+  decay_steps: int = 580000
+  distill_ground_truth_ratio: float = 0.5
+
+
+@dataclasses.dataclass
+class EdgeTPUBERTCustomParams(base_config.Config):
+  """EdgeTPU-BERT custom params.
+
+  Attributes:
+    train_dataset: An instance of the DatasetParams.
+    eval_dataset: An instance of the DatasetParams.
+    teacher_model: An instance of the PretrainerModelParams. If None, then the
+      student model is trained independently without distillation.
+    student_model: An instance of the PretrainerModelParams
+    teacher_model_init_checkpoint: Path for the teacher model init checkpoint.
+    student_model_init_checkpoint: Path for the student model init checkpoint.
+    layer_wise_distillation: Distillation config for the layer-wise step.
+    end_to_end_distillation: Distillation config for the end2end step.
+    optimizer: An instance of the OptimizerParams.
+    runtime: An instance of the RuntimeParams.
+    learning_rate: An instance of the LearningRateParams.
+    orbit_config: An instance of the OrbitParams.
+    distill_ground_truth_ratio: A float number representing the ratio between
+      distillation output and ground truth.
+  """
+  train_datasest: DatasetParams = DatasetParams()
+  eval_dataset: DatasetParams = DatasetParams()
+  teacher_model: Optional[PretrainerModelParams] = PretrainerModelParams()
+  student_model: PretrainerModelParams = PretrainerModelParams()
+  teacher_model_init_checkpoint: str = ''
+  student_model_init_checkpoint: str = ''
+  layer_wise_distillation: LayerWiseDistillationParams = (
+      LayerWiseDistillationParams())
+  end_to_end_distillation: EndToEndDistillationParams = (
+      EndToEndDistillationParams())
+  optimizer: OptimizerParams = OptimizerParams()
+  runtime: RuntimeParams = RuntimeParams()
+  orbit_config: OrbitParams = OrbitParams()
--- a/official/projects/edgetpu/nlp/experiments/downstream_tasks/glue_mnli.yaml
+++ b/official/projects/edgetpu/nlp/experiments/downstream_tasks/glue_mnli.yaml
+task:
+  # hub_module_url: 'gs://**/panzf/mobilebert/tfhub/'
+  init_checkpoint: 'gs://**/edgetpu_bert/edgetpu_bert_float_candidate_13_e2e_820k/exported_ckpt/'
+  model:
+    num_classes: 3
+  metric_type: 'accuracy'
+  train_data:
+    drop_remainder: true
+    global_batch_size: 32
+    input_path: gs://**/yo/bert/glue/tfrecords/MNLI/MNLI_matched_train.tf_record
+    is_training: true
+    seq_length: 128
+    label_type: 'int'
+  validation_data:
+    drop_remainder: false
+    global_batch_size: 32
+    input_path: gs://**/yo/bert/glue/tfrecords/MNLI/MNLI_matched_eval.tf_record
+    is_training: false
+    seq_length: 128
+    label_type: 'int'
+trainer:
+  checkpoint_interval: 10000
+  optimizer_config:
+    learning_rate:
+      polynomial:
+        # 100% of train_steps.
+        decay_steps: 50000
+        end_learning_rate: 0.0
+        initial_learning_rate: 3.0e-05
+        power: 1.0
+      type: polynomial
+    optimizer:
+      type: adamw
+    warmup:
+      polynomial:
+        power: 1
+        # ~10% of train_steps.
+        warmup_steps: 5000
+      type: polynomial
+  steps_per_loop: 1000
+  summary_interval: 1000
+  # Training data size 392,702 examples, 8 epochs.
+  train_steps: 50000
+  validation_interval: 2000
+  # Eval data size = 9815 examples.
+  validation_steps: 307
+  best_checkpoint_export_subdir: 'best_ckpt'
+  best_checkpoint_eval_metric: 'cls_accuracy'
+  best_checkpoint_metric_comp: 'higher'
--- a/official/projects/edgetpu/nlp/experiments/downstream_tasks/mobilebert_baseline.yaml
+++ b/official/projects/edgetpu/nlp/experiments/downstream_tasks/mobilebert_baseline.yaml
+# MobileBERT model from https://arxiv.org/abs/2004.02984.
+task:
+  model:
+    encoder:
+      type: mobilebert
+      mobilebert:
+        word_vocab_size: 30522
+        word_embed_size: 128
+        type_vocab_size: 2
+        max_sequence_length: 512
+        num_blocks: 24
+        hidden_size: 512
+        num_attention_heads: 4
+        intermediate_size: 512
+        hidden_activation: relu
+        hidden_dropout_prob: 0.0
+        attention_probs_dropout_prob: 0.1
+        intra_bottleneck_size: 128
+        initializer_range: 0.02
+        key_query_shared_bottleneck: true
+        num_feedforward_networks: 4
+        normalization_type: no_norm
+        classifier_activation: false
--- a/official/projects/edgetpu/nlp/experiments/downstream_tasks/mobilebert_edgetpu_m.yaml
+++ b/official/projects/edgetpu/nlp/experiments/downstream_tasks/mobilebert_edgetpu_m.yaml
+# MobileBERT-EdgeTPU model.
+task:
+  model:
+    encoder:
+      type: mobilebert
+      mobilebert:
+        word_vocab_size: 30522
+        word_embed_size: 128
+        type_vocab_size: 2
+        max_sequence_length: 512
+        num_blocks: 12
+        hidden_size: 512
+        num_attention_heads: 4
+        intermediate_size: 1024
+        hidden_activation: relu
+        hidden_dropout_prob: 0.1
+        attention_probs_dropout_prob: 0.1
+        intra_bottleneck_size: 256
+        initializer_range: 0.02
+        key_query_shared_bottleneck: true
+        num_feedforward_networks: 6
+        normalization_type: no_norm
+        classifier_activation: false
--- a/official/projects/edgetpu/nlp/experiments/downstream_tasks/mobilebert_edgetpu_s.yaml
+++ b/official/projects/edgetpu/nlp/experiments/downstream_tasks/mobilebert_edgetpu_s.yaml
+# MobileBERT-EdgeTPU-S model.
+task:
+  model:
+    encoder:
+      type: mobilebert
+      mobilebert:
+        word_vocab_size: 30522
+        word_embed_size: 128
+        type_vocab_size: 2
+        max_sequence_length: 512
+        num_blocks: 12
+        hidden_size: 512
+        num_attention_heads: 4
+        intermediate_size: 1024
+        hidden_activation: relu
+        hidden_dropout_prob: 0.1
+        attention_probs_dropout_prob: 0.1
+        intra_bottleneck_size: 256
+        initializer_range: 0.02
+        key_query_shared_bottleneck: true
+        num_feedforward_networks: 4
+        normalization_type: no_norm
+        classifier_activation: false
--- a/official/projects/edgetpu/nlp/experiments/downstream_tasks/mobilebert_edgetpu_xs.yaml
+++ b/official/projects/edgetpu/nlp/experiments/downstream_tasks/mobilebert_edgetpu_xs.yaml
+# MobileBERT-EdgeTPU-XS model.
+task:
+  model:
+    encoder:
+      type: mobilebert
+      mobilebert:
+        word_vocab_size: 30522
+        word_embed_size: 128
+        type_vocab_size: 2
+        max_sequence_length: 512
+        num_blocks: 8
+        hidden_size: 512
+        num_attention_heads: 4
+        intermediate_size: 1024
+        hidden_activation: relu
+        hidden_dropout_prob: 0.1
+        attention_probs_dropout_prob: 0.1
+        intra_bottleneck_size: 256
+        initializer_range: 0.02
+        key_query_shared_bottleneck: true
+        num_feedforward_networks: 4
+        normalization_type: no_norm
+        classifier_activation: false
--- a/official/projects/edgetpu/nlp/experiments/downstream_tasks/squad_v1.yaml
+++ b/official/projects/edgetpu/nlp/experiments/downstream_tasks/squad_v1.yaml
+task:
+  # hub_module_url: 'gs://**/panzf/mobilebert/tfhub/'
+  max_answer_length: 30
+  n_best_size: 20
+  null_score_diff_threshold: 0.0
+  init_checkpoint: 'gs://**/edgetpu_bert/edgetpu_bert_float_candidate_13_e2e_820k/exported_ckpt/'
+  train_data:
+    drop_remainder: true
+    global_batch_size: 32
+    input_path: gs://**/tp/bert/squad_v1.1/train.tf_record
+    is_training: true
+    seq_length: 384
+  validation_data:
+    do_lower_case: true
+    doc_stride: 128
+    drop_remainder: false
+    global_batch_size: 48
+    input_path: gs://**/squad/dev-v1.1.json
+    is_training: false
+    query_length: 64
+    seq_length: 384
+    tokenization: WordPiece
+    version_2_with_negative: false
+    vocab_file: gs://**/panzf/ttl-30d/mobilebert/tf2_checkpoint/vocab.txt
+trainer:
+  checkpoint_interval: 1000
+  max_to_keep: 5
+  optimizer_config:
+    learning_rate:
+      polynomial:
+        decay_steps: 19420
+        end_learning_rate: 0.0
+        initial_learning_rate: 8.0e-05
+        power: 1.0
+      type: polynomial
+    optimizer:
+      type: adamw
+    warmup:
+      polynomial:
+        power: 1
+        # 10% of total training steps
+        warmup_steps: 1942
+      type: polynomial
+  steps_per_loop: 1000
+  summary_interval: 1000
+  # 7 epochs for training
+  train_steps: 19420
+  validation_interval: 3000
+  validation_steps: 226
+  best_checkpoint_export_subdir: 'best_ckpt'
+  best_checkpoint_eval_metric: 'final_f1'
+  best_checkpoint_metric_comp: 'higher'
--- a/official/projects/edgetpu/nlp/experiments/mobilebert_baseline.yaml
+++ b/official/projects/edgetpu/nlp/experiments/mobilebert_baseline.yaml
+# Distillation pretraining for Mobilebert.
+# The final MLM accuracy is around 70.8% for e2e only training and 71.4% for layer-wise + e2e.
+layer_wise_distillation:
+  num_steps: 10000
+  warmup_steps: 0
+  initial_learning_rate: 1.5e-3
+  end_learning_rate: 1.5e-3
+  decay_steps: 10000
+end_to_end_distillation:
+  num_steps: 585000
+  warmup_steps: 20000
+  initial_learning_rate: 1.5e-3
+  end_learning_rate: 1.5e-7
+  decay_steps: 585000
+  distill_ground_truth_ratio: 0.5
+optimizer:
+  optimizer:
+    lamb:
+      beta_1: 0.9
+      beta_2: 0.999
+      clipnorm: 1.0
+      epsilon: 1.0e-06
+      exclude_from_layer_adaptation: null
+      exclude_from_weight_decay: ['LayerNorm', 'bias', 'norm']
+      global_clipnorm: null
+      name: LAMB
+      weight_decay_rate: 0.01
+    type: lamb
+orbit_config:
+  eval_interval: 1000
+  eval_steps: -1
+  mode: train
+  steps_per_loop: 1000
+  total_steps: 825000
+runtime:
+  distribution_strategy: 'tpu'
+student_model:
+  cls_heads: [{'activation': 'tanh',
+               'cls_token_idx': 0,
+               'dropout_rate': 0.0,
+               'inner_dim': 512,
+               'name': 'next_sentence',
+               'num_classes': 2}]
+  encoder:
+    mobilebert:
+      attention_probs_dropout_prob: 0.1
+      classifier_activation: false
+      hidden_activation: relu
+      hidden_dropout_prob: 0.0
+      hidden_size: 512
+      initializer_range: 0.02
+      input_mask_dtype: int32
+      intermediate_size: 512
+      intra_bottleneck_size: 128
+      key_query_shared_bottleneck: true
+      max_sequence_length: 512
+      normalization_type: no_norm
+      num_attention_heads: 4
+      num_blocks: 24
+      num_feedforward_networks: 4
+      type_vocab_size: 2
+      use_bottleneck_attention: false
+      word_embed_size: 128
+      word_vocab_size: 30522
+    type: mobilebert
+  mlm_activation: relu
+  mlm_initializer_range: 0.02
+teacher_model:
+  cls_heads: []
+  encoder:
+    mobilebert:
+      attention_probs_dropout_prob: 0.1
+      classifier_activation: false
+      hidden_activation: gelu
+      hidden_dropout_prob: 0.1
+      hidden_size: 512
+      initializer_range: 0.02
+      input_mask_dtype: int32
+      intermediate_size: 4096
+      intra_bottleneck_size: 1024
+      key_query_shared_bottleneck: false
+      max_sequence_length: 512
+      normalization_type: layer_norm
+      num_attention_heads: 4
+      num_blocks: 24
+      num_feedforward_networks: 1
+      type_vocab_size: 2
+      use_bottleneck_attention: false
+      word_embed_size: 128
+      word_vocab_size: 30522
+    type: mobilebert
+  mlm_activation: gelu
+  mlm_initializer_range: 0.02
+teacher_model_init_checkpoint: gs://**/uncased_L-24_H-1024_B-512_A-4_teacher/tf2_checkpoint/bert_model.ckpt-1
+student_model_init_checkpoint: ''
+train_datasest:
+  block_length: 1
+  cache: false
+  cycle_length: null
+  deterministic: null
+  drop_remainder: true
+  enable_tf_data_service: false
+  global_batch_size: 2048
+  input_path: gs://**/seq_512_mask_20/wikipedia.tfrecord*,gs://**/seq_512_mask_20/books.tfrecord*
+  is_training: true
+  max_predictions_per_seq: 20
+  seq_length: 512
+  sharding: true
+  shuffle_buffer_size: 100
+  tf_data_service_address: null
+  tf_data_service_job_name: null
+  tfds_as_supervised: false
+  tfds_data_dir: ''
+  tfds_name: ''
+  tfds_skip_decoding_feature: ''
+  tfds_split: ''
+  use_next_sentence_label: true
+  use_position_id: false
+  use_v2_feature_names: false
+eval_dataset:
+  block_length: 1
+  cache: false
+  cycle_length: null
+  deterministic: null
+  drop_remainder: true
+  enable_tf_data_service: false
+  global_batch_size: 2048
+  input_path: gs://**/seq_512_mask_20/wikipedia.tfrecord-00141-of-00500,gs://**/seq_512_mask_20/books.tfrecord-00141-of-00500
+  is_training: false
+  max_predictions_per_seq: 20
+  seq_length: 512
+  sharding: true
+  shuffle_buffer_size: 100
+  tf_data_service_address: null
+  tf_data_service_job_name: null
+  tfds_as_supervised: false
+  tfds_data_dir: ''
+  tfds_name: ''
+  tfds_skip_decoding_feature: ''
+  tfds_split: ''
+  use_next_sentence_label: true
+  use_position_id: false
+  use_v2_feature_names: false
--- a/official/projects/edgetpu/nlp/experiments/mobilebert_edgetpu_m.yaml
+++ b/official/projects/edgetpu/nlp/experiments/mobilebert_edgetpu_m.yaml
+layer_wise_distillation:
+  num_steps: 20000
+  warmup_steps: 0
+  initial_learning_rate: 1.5e-3
+  end_learning_rate: 1.5e-3
+  decay_steps: 20000
+end_to_end_distillation:
+  num_steps: 585000
+  warmup_steps: 20000
+  initial_learning_rate: 1.5e-3
+  end_learning_rate: 1.5e-7
+  decay_steps: 585000
+  distill_ground_truth_ratio: 0.5
+optimizer:
+  optimizer:
+    lamb:
+      beta_1: 0.9
+      beta_2: 0.999
+      clipnorm: 1.0
+      epsilon: 1.0e-06
+      exclude_from_layer_adaptation: null
+      exclude_from_weight_decay: ['LayerNorm', 'bias', 'norm']
+      global_clipnorm: null
+      name: LAMB
+      weight_decay_rate: 0.01
+    type: lamb
+orbit_config:
+  eval_interval: 1000
+  eval_steps: -1
+  mode: train
+  steps_per_loop: 1000
+  total_steps: 825000
+runtime:
+  distribution_strategy: 'tpu'
+student_model:
+  cls_heads: [{'activation': 'tanh',
+               'cls_token_idx': 0,
+               'dropout_rate': 0.0,
+               'inner_dim': 512,
+               'name': 'next_sentence',
+               'num_classes': 2}]
+  encoder:
+    mobilebert:
+      attention_probs_dropout_prob: 0.1
+      classifier_activation: false
+      hidden_activation: relu
+      hidden_dropout_prob: 0.0
+      hidden_size: 512
+      initializer_range: 0.02
+      input_mask_dtype: int32
+      intermediate_size: 1024
+      intra_bottleneck_size: 256
+      key_query_shared_bottleneck: true
+      max_sequence_length: 512
+      normalization_type: no_norm
+      num_attention_heads: 4
+      num_blocks: 12
+      num_feedforward_networks: 6
+      type_vocab_size: 2
+      use_bottleneck_attention: false
+      word_embed_size: 128
+      word_vocab_size: 30522
+    type: mobilebert
+  mlm_activation: relu
+  mlm_initializer_range: 0.02
+teacher_model:
+  cls_heads: []
+  encoder:
+    mobilebert:
+      attention_probs_dropout_prob: 0.1
+      classifier_activation: false
+      hidden_activation: gelu
+      hidden_dropout_prob: 0.1
+      hidden_size: 512
+      initializer_range: 0.02
+      input_mask_dtype: int32
+      intermediate_size: 4096
+      intra_bottleneck_size: 1024
+      key_query_shared_bottleneck: false
+      max_sequence_length: 512
+      normalization_type: layer_norm
+      num_attention_heads: 4
+      num_blocks: 24
+      num_feedforward_networks: 1
+      type_vocab_size: 2
+      use_bottleneck_attention: false
+      word_embed_size: 128
+      word_vocab_size: 30522
+    type: mobilebert
+  mlm_activation: gelu
+  mlm_initializer_range: 0.02
+teacher_model_init_checkpoint: gs://**/uncased_L-24_H-1024_B-512_A-4_teacher/tf2_checkpoint/bert_model.ckpt-1
+student_model_init_checkpoint: ''
+train_datasest:
+  block_length: 1
+  cache: false
+  cycle_length: null
+  deterministic: null
+  drop_remainder: true
+  enable_tf_data_service: false
+  global_batch_size: 2048
+  input_path: gs://**/seq_512_mask_20/wikipedia.tfrecord*,gs://**/seq_512_mask_20/books.tfrecord*
+  is_training: true
+  max_predictions_per_seq: 20
+  seq_length: 512
+  sharding: true
+  shuffle_buffer_size: 100
+  tf_data_service_address: null
+  tf_data_service_job_name: null
+  tfds_as_supervised: false
+  tfds_data_dir: ''
+  tfds_name: ''
+  tfds_skip_decoding_feature: ''
+  tfds_split: ''
+  use_next_sentence_label: true
+  use_position_id: false
+  use_v2_feature_names: false
+eval_dataset:
+  block_length: 1
+  cache: false
+  cycle_length: null
+  deterministic: null
+  drop_remainder: true
+  enable_tf_data_service: false
+  global_batch_size: 2048
+  input_path: gs://**/seq_512_mask_20/wikipedia.tfrecord-00141-of-00500,gs://**/seq_512_mask_20/books.tfrecord-00141-of-00500
+  is_training: false
+  max_predictions_per_seq: 20
+  seq_length: 512
+  sharding: true
+  shuffle_buffer_size: 100
+  tf_data_service_address: null
+  tf_data_service_job_name: null
+  tfds_as_supervised: false
+  tfds_data_dir: ''
+  tfds_name: ''
+  tfds_skip_decoding_feature: ''
+  tfds_split: ''
+  use_next_sentence_label: true
+  use_position_id: false
+  use_v2_feature_names: false
--- a/official/projects/edgetpu/nlp/experiments/mobilebert_edgetpu_s.yaml
+++ b/official/projects/edgetpu/nlp/experiments/mobilebert_edgetpu_s.yaml
+layer_wise_distillation:
+  num_steps: 20000
+  warmup_steps: 0
+  initial_learning_rate: 1.5e-3
+  end_learning_rate: 1.5e-3
+  decay_steps: 20000
+end_to_end_distillation:
+  num_steps: 585000
+  warmup_steps: 20000
+  initial_learning_rate: 1.5e-3
+  end_learning_rate: 1.5e-7
+  decay_steps: 585000
+  distill_ground_truth_ratio: 0.5
+optimizer:
+  optimizer:
+    lamb:
+      beta_1: 0.9
+      beta_2: 0.999
+      clipnorm: 1.0
+      epsilon: 1.0e-06
+      exclude_from_layer_adaptation: null
+      exclude_from_weight_decay: ['LayerNorm', 'bias', 'norm']
+      global_clipnorm: null
+      name: LAMB
+      weight_decay_rate: 0.01
+    type: lamb
+orbit_config:
+  eval_interval: 1000
+  eval_steps: -1
+  mode: train
+  steps_per_loop: 1000
+  total_steps: 825000
+runtime:
+  distribution_strategy: 'tpu'
+student_model:
+  cls_heads: [{'activation': 'tanh',
+               'cls_token_idx': 0,
+               'dropout_rate': 0.0,
+               'inner_dim': 512,
+               'name': 'next_sentence',
+               'num_classes': 2}]
+  encoder:
+    mobilebert:
+      attention_probs_dropout_prob: 0.1
+      classifier_activation: false
+      hidden_activation: relu
+      hidden_dropout_prob: 0.0
+      hidden_size: 512
+      initializer_range: 0.02
+      input_mask_dtype: int32
+      intermediate_size: 1024
+      intra_bottleneck_size: 256
+      key_query_shared_bottleneck: true
+      max_sequence_length: 512
+      normalization_type: no_norm
+      num_attention_heads: 4
+      num_blocks: 12
+      num_feedforward_networks: 4
+      type_vocab_size: 2
+      use_bottleneck_attention: false
+      word_embed_size: 128
+      word_vocab_size: 30522
+    type: mobilebert
+  mlm_activation: relu
+  mlm_initializer_range: 0.02
+teacher_model:
+  cls_heads: []
+  encoder:
+    mobilebert:
+      attention_probs_dropout_prob: 0.1
+      classifier_activation: false
+      hidden_activation: gelu
+      hidden_dropout_prob: 0.1
+      hidden_size: 512
+      initializer_range: 0.02
+      input_mask_dtype: int32
+      intermediate_size: 4096
+      intra_bottleneck_size: 1024
+      key_query_shared_bottleneck: false
+      max_sequence_length: 512
+      normalization_type: layer_norm
+      num_attention_heads: 4
+      num_blocks: 24
+      num_feedforward_networks: 1
+      type_vocab_size: 2
+      use_bottleneck_attention: false
+      word_embed_size: 128
+      word_vocab_size: 30522
+    type: mobilebert
+  mlm_activation: gelu
+  mlm_initializer_range: 0.02
+teacher_model_init_checkpoint: gs://**/uncased_L-24_H-1024_B-512_A-4_teacher/tf2_checkpoint/bert_model.ckpt-1
+student_model_init_checkpoint: ''
+train_datasest:
+  block_length: 1
+  cache: false
+  cycle_length: null
+  deterministic: null
+  drop_remainder: true
+  enable_tf_data_service: false
+  global_batch_size: 2048
+  input_path: gs://**/seq_512_mask_20/wikipedia.tfrecord*,gs://**/seq_512_mask_20/books.tfrecord*
+  is_training: true
+  max_predictions_per_seq: 20
+  seq_length: 512
+  sharding: true
+  shuffle_buffer_size: 100
+  tf_data_service_address: null
+  tf_data_service_job_name: null
+  tfds_as_supervised: false
+  tfds_data_dir: ''
+  tfds_name: ''
+  tfds_skip_decoding_feature: ''
+  tfds_split: ''
+  use_next_sentence_label: true
+  use_position_id: false
+  use_v2_feature_names: false
+eval_dataset:
+  block_length: 1
+  cache: false
+  cycle_length: null
+  deterministic: null
+  drop_remainder: true
+  enable_tf_data_service: false
+  global_batch_size: 2048
+  input_path: gs://**/seq_512_mask_20/wikipedia.tfrecord-00141-of-00500,gs://**/seq_512_mask_20/books.tfrecord-00141-of-00500
+  is_training: false
+  max_predictions_per_seq: 20
+  seq_length: 512
+  sharding: true
+  shuffle_buffer_size: 100
+  tf_data_service_address: null
+  tf_data_service_job_name: null
+  tfds_as_supervised: false
+  tfds_data_dir: ''
+  tfds_name: ''
+  tfds_skip_decoding_feature: ''
+  tfds_split: ''
+  use_next_sentence_label: true
+  use_position_id: false
+  use_v2_feature_names: false
--- a/official/projects/edgetpu/nlp/experiments/mobilebert_edgetpu_xs.yaml
+++ b/official/projects/edgetpu/nlp/experiments/mobilebert_edgetpu_xs.yaml
+layer_wise_distillation:
+  num_steps: 30000
+  warmup_steps: 0
+  initial_learning_rate: 1.5e-3
+  end_learning_rate: 1.5e-3
+  decay_steps: 30000
+end_to_end_distillation:
+  num_steps: 585000
+  warmup_steps: 20000
+  initial_learning_rate: 1.5e-3
+  end_learning_rate: 1.5e-7
+  decay_steps: 585000
+  distill_ground_truth_ratio: 0.5
+optimizer:
+  optimizer:
+    lamb:
+      beta_1: 0.9
+      beta_2: 0.999
+      clipnorm: 1.0
+      epsilon: 1.0e-06
+      exclude_from_layer_adaptation: null
+      exclude_from_weight_decay: ['LayerNorm', 'bias', 'norm']
+      global_clipnorm: null
+      name: LAMB
+      weight_decay_rate: 0.01
+    type: lamb
+orbit_config:
+  eval_interval: 1000
+  eval_steps: -1
+  mode: train
+  steps_per_loop: 1000
+  total_steps: 825000
+runtime:
+  distribution_strategy: 'tpu'
+student_model:
+  cls_heads: [{'activation': 'tanh',
+               'cls_token_idx': 0,
+               'dropout_rate': 0.0,
+               'inner_dim': 512,
+               'name': 'next_sentence',
+               'num_classes': 2}]
+  encoder:
+    mobilebert:
+      attention_probs_dropout_prob: 0.1
+      classifier_activation: false
+      hidden_activation: relu
+      hidden_dropout_prob: 0.0
+      hidden_size: 512
+      initializer_range: 0.02
+      input_mask_dtype: int32
+      intermediate_size: 1024
+      intra_bottleneck_size: 256
+      key_query_shared_bottleneck: true
+      max_sequence_length: 512
+      normalization_type: no_norm
+      num_attention_heads: 4
+      num_blocks: 8
+      num_feedforward_networks: 4
+      type_vocab_size: 2
+      use_bottleneck_attention: false
+      word_embed_size: 128
+      word_vocab_size: 30522
+    type: mobilebert
+  mlm_activation: relu
+  mlm_initializer_range: 0.02
+teacher_model:
+  cls_heads: []
+  encoder:
+    mobilebert:
+      attention_probs_dropout_prob: 0.1
+      classifier_activation: false
+      hidden_activation: gelu
+      hidden_dropout_prob: 0.1
+      hidden_size: 512
+      initializer_range: 0.02
+      input_mask_dtype: int32
+      intermediate_size: 4096
+      intra_bottleneck_size: 1024
+      key_query_shared_bottleneck: false
+      max_sequence_length: 512
+      normalization_type: layer_norm
+      num_attention_heads: 4
+      num_blocks: 24
+      num_feedforward_networks: 1
+      type_vocab_size: 2
+      use_bottleneck_attention: false
+      word_embed_size: 128
+      word_vocab_size: 30522
+    type: mobilebert
+  mlm_activation: gelu
+  mlm_initializer_range: 0.02
+teacher_model_init_checkpoint: gs://**/uncased_L-24_H-1024_B-512_A-4_teacher/tf2_checkpoint/bert_model.ckpt-1
+student_model_init_checkpoint: ''
+train_datasest:
+  block_length: 1
+  cache: false
+  cycle_length: null
+  deterministic: null
+  drop_remainder: true
+  enable_tf_data_service: false
+  global_batch_size: 2048
+  input_path: gs://**/seq_512_mask_20/wikipedia.tfrecord*,gs://**/seq_512_mask_20/books.tfrecord*
+  is_training: true
+  max_predictions_per_seq: 20
+  seq_length: 512
+  sharding: true
+  shuffle_buffer_size: 100
+  tf_data_service_address: null
+  tf_data_service_job_name: null
+  tfds_as_supervised: false
+  tfds_data_dir: ''
+  tfds_name: ''
+  tfds_skip_decoding_feature: ''
+  tfds_split: ''
+  use_next_sentence_label: true
+  use_position_id: false
+  use_v2_feature_names: false
+eval_dataset:
+  block_length: 1
+  cache: false
+  cycle_length: null
+  deterministic: null
+  drop_remainder: true
+  enable_tf_data_service: false
+  global_batch_size: 2048
+  input_path: gs://**/seq_512_mask_20/wikipedia.tfrecord-00141-of-00500,gs://**/seq_512_mask_20/books.tfrecord-00141-of-00500
+  is_training: false
+  max_predictions_per_seq: 20
+  seq_length: 512
+  sharding: true
+  shuffle_buffer_size: 100
+  tf_data_service_address: null
+  tf_data_service_job_name: null
+  tfds_as_supervised: false
+  tfds_data_dir: ''
+  tfds_name: ''
+  tfds_skip_decoding_feature: ''
+  tfds_split: ''
+  use_next_sentence_label: true
+  use_position_id: false
+  use_v2_feature_names: false
--- a/official/projects/edgetpu/nlp/mobilebert_edgetpu_trainer.py
+++ b/official/projects/edgetpu/nlp/mobilebert_edgetpu_trainer.py
--- a/official/projects/edgetpu/nlp/mobilebert_edgetpu_trainer_test.py
+++ b/official/projects/edgetpu/nlp/mobilebert_edgetpu_trainer_test.py
+# Copyright 2021 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Tests for mobilebert_edgetpu_trainer.py."""
+
+import tensorflow as tf
+
+from official.projects.edgetpu.nlp import mobilebert_edgetpu_trainer
+from official.projects.edgetpu.nlp.configs import params
+from official.projects.edgetpu.nlp.modeling import model_builder
+
+
+# Helper function to create dummy dataset
+def _dummy_dataset():
+  def dummy_data(_):
+    dummy_ids = tf.zeros((1, 64), dtype=tf.int32)
+    dummy_lm = tf.zeros((1, 64), dtype=tf.int32)
+    return dict(
+        input_word_ids=dummy_ids,
+        input_mask=dummy_ids,
+        input_type_ids=dummy_ids,
+        masked_lm_positions=dummy_lm,
+        masked_lm_ids=dummy_lm,
+        masked_lm_weights=tf.cast(dummy_lm, dtype=tf.float32),
+        next_sentence_labels=tf.zeros((1, 1), dtype=tf.int32))
+  dataset = tf.data.Dataset.range(1)
+  dataset = dataset.repeat()
+  dataset = dataset.map(
+      dummy_data, num_parallel_calls=tf.data.experimental.AUTOTUNE)
+  return dataset
+
+
+class EdgetpuBertTrainerTest(tf.test.TestCase):
+
+  def setUp(self):
+    super(EdgetpuBertTrainerTest, self).setUp()
+    config_path = 'third_party/tensorflow_models/official/projects/edgetpu/nlp/experiments/mobilebert_edgetpu_m.yaml'
+    self.experiment_params = params.EdgeTPUBERTCustomParams.from_yaml(
+        config_path)
+    self.strategy = tf.distribute.get_strategy()
+    self.experiment_params.train_datasest.input_path = 'dummy'
+    self.experiment_params.eval_dataset.input_path = 'dummy'
+
+  def test_train_model_locally(self):
+    """Tests training a model locally with one step."""
+    teacher_model = model_builder.build_bert_pretrainer(
+        pretrainer_cfg=self.experiment_params.teacher_model,
+        name='teacher')
+    _ = teacher_model(teacher_model.inputs)
+    student_model = model_builder.build_bert_pretrainer(
+        pretrainer_cfg=self.experiment_params.student_model,
+        name='student')
+    _ = student_model(student_model.inputs)
+    trainer = mobilebert_edgetpu_trainer.MobileBERTEdgeTPUDistillationTrainer(
+        teacher_model=teacher_model,
+        student_model=student_model,
+        strategy=self.strategy,
+        experiment_params=self.experiment_params)
+
+    # Rebuild dummy dataset since loading real dataset will cause timeout error.
+    trainer.train_dataset = _dummy_dataset()
+    trainer.eval_dataset = _dummy_dataset()
+    train_dataset_iter = iter(trainer.train_dataset)
+    eval_dataset_iter = iter(trainer.eval_dataset)
+    trainer.train_loop_begin()
+
+    trainer.train_step(train_dataset_iter)
+    trainer.eval_step(eval_dataset_iter)
+
+
+if __name__ == '__main__':
+  tf.test.main()
--- a/official/projects/edgetpu/nlp/modeling/__init__.py
+++ b/official/projects/edgetpu/nlp/modeling/__init__.py
+# Copyright 2021 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
--- a/official/projects/edgetpu/nlp/modeling/edgetpu_layers.py
+++ b/official/projects/edgetpu/nlp/modeling/edgetpu_layers.py
+# Copyright 2021 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Customized MobileBERT-EdgeTPU layers.
+
+There are two reasons for us to customize the layers instead of using the well-
+defined layers used in baseline MobileBERT.
+1. The layer introduces compiler sharding failures. For example, the gather in
+   OnDeviceEmbedding.
+2. The layer contains ops that need to have bounded input/output ranges. For
+   example, softmax op.
+"""
+import string
+
+import numpy as np
+import tensorflow as tf
+
+from official.nlp.modeling import layers
+
+_CHR_IDX = string.ascii_lowercase
+
+
+# This function is directly copied from the tf.keras.layers.MultiHeadAttention
+# implementation.
+def _build_attention_equation(rank, attn_axes):
+  """Builds einsum equations for the attention computation.
+
+  Query, key, value inputs after projection are expected to have the shape as:
+  `(bs, <non-attention dims>, <attention dims>, num_heads, channels)`.
+  `bs` and `<non-attention dims>` are treated as `<batch dims>`.
+
+  The attention operations can be generalized:
+  (1) Query-key dot product:
+  `(<batch dims>, <query attention dims>, num_heads, channels), (<batch dims>,
+  <key attention dims>, num_heads, channels) -> (<batch dims>,
+  num_heads, <query attention dims>, <key attention dims>)`
+  (2) Combination:
+  `(<batch dims>, num_heads, <query attention dims>, <key attention dims>),
+  (<batch dims>, <value attention dims>, num_heads, channels) -> (<batch dims>,
+  <query attention dims>, num_heads, channels)`
+
+  Args:
+    rank: Rank of query, key, value tensors.
+    attn_axes: List/tuple of axes, `[-1, rank)`,
+      that attention will be applied to.
+
+  Returns:
+    Einsum equations.
+  """
+  target_notation = _CHR_IDX[:rank]
+  # `batch_dims` includes the head dim.
+  batch_dims = tuple(np.delete(range(rank), attn_axes + (rank - 1,)))
+  letter_offset = rank
+  source_notation = ''
+  for i in range(rank):
+    if i in batch_dims or i == rank - 1:
+      source_notation += target_notation[i]
+    else:
+      source_notation += _CHR_IDX[letter_offset]
+      letter_offset += 1
+
+  product_notation = ''.join([target_notation[i] for i in batch_dims] +
+                             [target_notation[i] for i in attn_axes] +
+                             [source_notation[i] for i in attn_axes])
+  dot_product_equation = '%s,%s->%s' % (source_notation, target_notation,
+                                        product_notation)
+  attn_scores_rank = len(product_notation)
+  combine_equation = '%s,%s->%s' % (product_notation, source_notation,
+                                    target_notation)
+  return dot_product_equation, combine_equation, attn_scores_rank
+
+
+@tf.keras.utils.register_keras_serializable(package='Text')
+class EdgeTPUSoftmax(tf.keras.layers.Softmax):
+  """EdgeTPU/Quantization friendly implementation for the SoftMax.
+
+  When export quant model, use -120 mask value.
+  When export float model and run inference with bf16 on device, use -10000.
+  """
+
+  def __init__(self,
+               mask_value: int = -120,
+               **kwargs):
+    self._mask_value = mask_value
+    super(EdgeTPUSoftmax, self).__init__(**kwargs)
+
+  def get_config(self):
+    config = {
+        'mask_value': self._mask_value
+    }
+    base_config = super(EdgeTPUSoftmax, self).get_config()
+    return dict(list(base_config.items()) + list(config.items()))
+
+  def call(self, inputs, mask=None):
+    if mask is not None:
+      adder = (1.0 - tf.cast(mask, inputs.dtype)) * self._mask_value
+      inputs += adder
+    if isinstance(self.axis, (tuple, list)):
+      if len(self.axis) > 1:
+        return tf.exp(inputs - tf.reduce_logsumexp(
+            inputs, axis=self.axis, keepdims=True))
+      else:
+        return tf.keras.backend.softmax(inputs, axis=self.axis[0])
+    return tf.keras.backend.softmax(inputs, axis=self.axis)
+
+
+@tf.keras.utils.register_keras_serializable(package='Text')
+class EdgeTPUMultiHeadAttention(tf.keras.layers.MultiHeadAttention):
+  """Quantization friendly implementation for the MultiHeadAttention."""
+
+  def _build_attention(self, rank):
+    """Builds multi-head dot-product attention computations.
+
+    This function builds attributes necessary for `_compute_attention` to
+    costomize attention computation to replace the default dot-product
+    attention.
+
+    Args:
+      rank: the rank of query, key, value tensors.
+    """
+    if self._attention_axes is None:
+      self._attention_axes = tuple(range(1, rank - 2))
+    else:
+      self._attention_axes = tuple(self._attention_axes)
+    self._dot_product_equation, self._combine_equation, attn_scores_rank = (
+        _build_attention_equation(
+            rank, attn_axes=self._attention_axes))
+    norm_axes = tuple(
+        range(attn_scores_rank - len(self._attention_axes), attn_scores_rank))
+    self._softmax = EdgeTPUSoftmax(axis=norm_axes)
+    self._dropout_layer = tf.keras.layers.Dropout(rate=self._dropout)
+
+
+class EdgetpuMobileBertTransformer(layers.MobileBertTransformer):
+  """Quantization friendly MobileBertTransformer.
+
+  Inherits from the MobileBertTransformer but use our customized MHA.
+  """
+
+  def __init__(self, **kwargs):
+    super(EdgetpuMobileBertTransformer, self).__init__(**kwargs)
+    attention_head_size = int(
+        self.intra_bottleneck_size / self.num_attention_heads)
+    attention_layer = EdgeTPUMultiHeadAttention(
+        num_heads=self.num_attention_heads,
+        key_dim=attention_head_size,
+        value_dim=attention_head_size,
+        dropout=self.attention_probs_dropout_prob,
+        output_shape=self.intra_bottleneck_size,
+        kernel_initializer=self.initializer,
+        name='attention')
+    layer_norm = self.block_layers['attention'][1]
+    self.block_layers['attention'] = [attention_layer, layer_norm]
+
--- a/official/projects/edgetpu/nlp/modeling/edgetpu_layers_test.py
+++ b/official/projects/edgetpu/nlp/modeling/edgetpu_layers_test.py
+# Copyright 2021 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Tests for custom layers used by MobileBERT-EdgeTPU."""
+from absl.testing import parameterized
+import numpy as np
+import tensorflow as tf
+
+from official.projects.edgetpu.nlp.modeling import edgetpu_layers
+
+keras = tf.keras
+
+
+class MultiHeadAttentionTest(tf.test.TestCase, parameterized.TestCase):
+
+  @parameterized.named_parameters(
+      ("key_value_same_proj", None, None, [40, 80]),
+      ("key_value_different_proj", 32, 60, [40, 60]),
+  )
+  def test_non_masked_attention(self, value_dim, output_shape, output_dims):
+    """Test that the attention layer can be created without a mask tensor."""
+    test_layer = edgetpu_layers.EdgeTPUMultiHeadAttention(
+        num_heads=12,
+        key_dim=64,
+        value_dim=value_dim,
+        output_shape=output_shape)
+    # Create a 3-dimensional input (the first dimension is implicit).
+    query = keras.Input(shape=(40, 80))
+    value = keras.Input(shape=(20, 80))
+    output = test_layer(query=query, value=value)
+    self.assertEqual(output.shape.as_list(), [None] + output_dims)
+
+  def test_non_masked_self_attention(self):
+    """Test with one input (self-attenntion) and no mask tensor."""
+    test_layer = edgetpu_layers.EdgeTPUMultiHeadAttention(
+        num_heads=12, key_dim=64)
+    # Create a 3-dimensional input (the first dimension is implicit).
+    query = keras.Input(shape=(40, 80))
+    output = test_layer(query, query)
+    self.assertEqual(output.shape.as_list(), [None, 40, 80])
+
+  def test_attention_scores(self):
+    """Test attention outputs with coefficients."""
+    test_layer = edgetpu_layers.EdgeTPUMultiHeadAttention(
+        num_heads=12, key_dim=64)
+    # Create a 3-dimensional input (the first dimension is implicit).
+    query = keras.Input(shape=(40, 80))
+    output, coef = test_layer(query, query, return_attention_scores=True)
+    self.assertEqual(output.shape.as_list(), [None, 40, 80])
+    self.assertEqual(coef.shape.as_list(), [None, 12, 40, 40])
+
+  def test_attention_scores_with_values(self):
+    """Test attention outputs with coefficients."""
+    test_layer = edgetpu_layers.EdgeTPUMultiHeadAttention(
+        num_heads=12, key_dim=64)
+    # Create a 3-dimensional input (the first dimension is implicit).
+    query = keras.Input(shape=(40, 80))
+    value = keras.Input(shape=(60, 80))
+    output, coef = test_layer(query, value, return_attention_scores=True)
+    self.assertEqual(output.shape.as_list(), [None, 40, 80])
+    self.assertEqual(coef.shape.as_list(), [None, 12, 40, 60])
+
+  @parameterized.named_parameters(("with_bias", True), ("no_bias", False))
+  def test_masked_attention(self, use_bias):
+    """Test with a mask tensor."""
+    test_layer = edgetpu_layers.EdgeTPUMultiHeadAttention(
+        num_heads=2, key_dim=2, use_bias=use_bias)
+    # Create a 3-dimensional input (the first dimension is implicit).
+    batch_size = 3
+    query = keras.Input(shape=(4, 8))
+    value = keras.Input(shape=(2, 8))
+    mask_tensor = keras.Input(shape=(4, 2))
+    output = test_layer(query=query, value=value, attention_mask=mask_tensor)
+
+    # Create a model containing the test layer.
+    model = keras.Model([query, value, mask_tensor], output)
+
+    # Generate data for the input (non-mask) tensors.
+    from_data = 10 * np.random.random_sample((batch_size, 4, 8))
+    to_data = 10 * np.random.random_sample((batch_size, 2, 8))
+
+    # Invoke the data with a random set of mask data. This should mask at least
+    # one element.
+    mask_data = np.random.randint(2, size=(batch_size, 4, 2))
+    masked_output_data = model.predict([from_data, to_data, mask_data])
+
+    # Invoke the same data, but with a null mask (where no elements are masked).
+    null_mask_data = np.ones((batch_size, 4, 2))
+    unmasked_output_data = model.predict([from_data, to_data, null_mask_data])
+
+    # Because one data is masked and one is not, the outputs should not be the
+    # same.
+    self.assertNotAllClose(masked_output_data, unmasked_output_data)
+
+    # Tests the layer with three inputs: Q, K, V.
+    key = keras.Input(shape=(2, 8))
+    output = test_layer(query, value=value, key=key, attention_mask=mask_tensor)
+    model = keras.Model([query, value, key, mask_tensor], output)
+
+    masked_output_data = model.predict([from_data, to_data, to_data, mask_data])
+    unmasked_output_data = model.predict(
+        [from_data, to_data, to_data, null_mask_data])
+    # Because one data is masked and one is not, the outputs should not be the
+    # same.
+    self.assertNotAllClose(masked_output_data, unmasked_output_data)
+
+    if use_bias:
+      self.assertLen(test_layer._query_dense.trainable_variables, 2)
+      self.assertLen(test_layer._output_dense.trainable_variables, 2)
+    else:
+      self.assertLen(test_layer._query_dense.trainable_variables, 1)
+      self.assertLen(test_layer._output_dense.trainable_variables, 1)
+
+  def test_initializer(self):
+    """Test with a specified initializer."""
+    test_layer = edgetpu_layers.EdgeTPUMultiHeadAttention(
+        num_heads=12,
+        key_dim=64,
+        kernel_initializer=keras.initializers.TruncatedNormal(stddev=0.02))
+    # Create a 3-dimensional input (the first dimension is implicit).
+    query = keras.Input(shape=(40, 80))
+    output = test_layer(query, query)
+    self.assertEqual(output.shape.as_list(), [None, 40, 80])
+
+  def test_masked_attention_with_scores(self):
+    """Test with a mask tensor."""
+    test_layer = edgetpu_layers.EdgeTPUMultiHeadAttention(
+        num_heads=2, key_dim=2)
+    # Create a 3-dimensional input (the first dimension is implicit).
+    batch_size = 3
+    query = keras.Input(shape=(4, 8))
+    value = keras.Input(shape=(2, 8))
+    mask_tensor = keras.Input(shape=(4, 2))
+    output = test_layer(query=query, value=value, attention_mask=mask_tensor)
+
+    # Create a model containing the test layer.
+    model = keras.Model([query, value, mask_tensor], output)
+
+    # Generate data for the input (non-mask) tensors.
+    from_data = 10 * np.random.random_sample((batch_size, 4, 8))
+    to_data = 10 * np.random.random_sample((batch_size, 2, 8))
+
+    # Invoke the data with a random set of mask data. This should mask at least
+    # one element.
+    mask_data = np.random.randint(2, size=(batch_size, 4, 2))
+    masked_output_data = model.predict([from_data, to_data, mask_data])
+
+    # Invoke the same data, but with a null mask (where no elements are masked).
+    null_mask_data = np.ones((batch_size, 4, 2))
+    unmasked_output_data = model.predict([from_data, to_data, null_mask_data])
+
+    # Because one data is masked and one is not, the outputs should not be the
+    # same.
+    self.assertNotAllClose(masked_output_data, unmasked_output_data)
+
+    # Create a model containing attention scores.
+    output, scores = test_layer(
+        query=query, value=value, attention_mask=mask_tensor,
+        return_attention_scores=True)
+    model = keras.Model([query, value, mask_tensor], [output, scores])
+    masked_output_data_score, masked_score = model.predict(
+        [from_data, to_data, mask_data])
+    unmasked_output_data_score, unmasked_score = model.predict(
+        [from_data, to_data, null_mask_data])
+    self.assertNotAllClose(masked_output_data_score, unmasked_output_data_score)
+    self.assertAllClose(masked_output_data, masked_output_data_score)
+    self.assertAllClose(unmasked_output_data, unmasked_output_data_score)
+    self.assertNotAllClose(masked_score, unmasked_score)
+
+  @parameterized.named_parameters(
+      ("4d_inputs_1freebatch_mask2", [3, 4], [3, 2], [4, 2],
+       (2,)), ("4d_inputs_1freebatch_mask3", [3, 4], [3, 2], [3, 4, 2], (2,)),
+      ("4d_inputs_1freebatch_mask4", [3, 4], [3, 2], [3, 2, 4, 2],
+       (2,)), ("4D_inputs_2D_attention", [3, 4], [3, 2], [3, 4, 3, 2], (1, 2)),
+      ("5D_inputs_2D_attention", [5, 3, 4], [5, 3, 2], [3, 4, 3, 2], (2, 3)),
+      ("5D_inputs_2D_attention_fullmask", [5, 3, 4], [5, 3, 2], [5, 3, 4, 3, 2],
+       (2, 3)))
+  def test_high_dim_attention(self, q_dims, v_dims, mask_dims, attention_axes):
+    """Test with a mask tensor."""
+    test_layer = edgetpu_layers.EdgeTPUMultiHeadAttention(
+        num_heads=2, key_dim=2, attention_axes=attention_axes)
+    batch_size, hidden_size = 3, 8
+    # Generate data for the input (non-mask) tensors.
+    query_shape = [batch_size] + q_dims + [hidden_size]
+    value_shape = [batch_size] + v_dims + [hidden_size]
+    mask_shape = [batch_size] + mask_dims
+    query = 10 * np.random.random_sample(query_shape)
+    value = 10 * np.random.random_sample(value_shape)
+
+    # Invoke the data with a random set of mask data. This should mask at least
+    # one element.
+    mask_data = np.random.randint(2, size=mask_shape).astype("bool")
+    # Invoke the same data, but with a null mask (where no elements are masked).
+    null_mask_data = np.ones(mask_shape)
+    # Because one data is masked and one is not, the outputs should not be the
+    # same.
+    query_tensor = keras.Input(query_shape[1:], name="query")
+    value_tensor = keras.Input(value_shape[1:], name="value")
+    mask_tensor = keras.Input(mask_shape[1:], name="mask")
+    output = test_layer(query=query_tensor, value=value_tensor,
+                        attention_mask=mask_tensor)
+    model = keras.Model([query_tensor, value_tensor, mask_tensor], output)
+
+    self.assertNotAllClose(
+        model.predict([query, value, mask_data]),
+        model.predict([query, value, null_mask_data]))
+
+  def test_dropout(self):
+    test_layer = edgetpu_layers.EdgeTPUMultiHeadAttention(
+        num_heads=2, key_dim=2, dropout=0.5)
+
+    # Generate data for the input (non-mask) tensors.
+    from_data = keras.backend.ones(shape=(32, 4, 8))
+    to_data = keras.backend.ones(shape=(32, 2, 8))
+    train_out = test_layer(from_data, to_data, None, None, None, True)
+    test_out = test_layer(from_data, to_data, None, None, None, False)
+
+    # Output should be close when not in training mode,
+    # and should not be close when enabling dropout in training mode.
+    self.assertNotAllClose(
+        keras.backend.eval(train_out),
+        keras.backend.eval(test_out))
+
+
+if __name__ == "__main__":
+  tf.test.main()
--- a/official/projects/edgetpu/nlp/modeling/encoder.py
+++ b/official/projects/edgetpu/nlp/modeling/encoder.py
+# Copyright 2021 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""MobileBERT text encoder network."""
+
+import tensorflow as tf
+
+from official.nlp import modeling
+from official.nlp.modeling import layers
+from official.projects.edgetpu.nlp.modeling import edgetpu_layers
+
+
+@tf.keras.utils.register_keras_serializable(package='Text')
+class MobileBERTEncoder(tf.keras.Model):
+  """A Keras functional API implementation for MobileBERT encoder."""
+
+  def __init__(self,
+               word_vocab_size=30522,
+               word_embed_size=128,
+               type_vocab_size=2,
+               max_sequence_length=512,
+               num_blocks=24,
+               hidden_size=512,
+               num_attention_heads=4,
+               intermediate_size=512,
+               intermediate_act_fn='relu',
+               hidden_dropout_prob=0.1,
+               attention_probs_dropout_prob=0.1,
+               intra_bottleneck_size=128,
+               initializer_range=0.02,
+               use_bottleneck_attention=False,
+               key_query_shared_bottleneck=True,
+               num_feedforward_networks=4,
+               normalization_type='no_norm',
+               classifier_activation=False,
+               input_mask_dtype='int32',
+               quantization_friendly=True,
+               **kwargs):
+    """Class initialization.
+
+    Args:
+      word_vocab_size: Number of words in the vocabulary.
+      word_embed_size: Word embedding size.
+      type_vocab_size: Number of word types.
+      max_sequence_length: Maximum length of input sequence.
+      num_blocks: Number of transformer block in the encoder model.
+      hidden_size: Hidden size for the transformer block.
+      num_attention_heads: Number of attention heads in the transformer block.
+      intermediate_size: The size of the "intermediate" (a.k.a., feed
+        forward) layer.
+      intermediate_act_fn: The non-linear activation function to apply
+        to the output of the intermediate/feed-forward layer.
+      hidden_dropout_prob: Dropout probability for the hidden layers.
+      attention_probs_dropout_prob: Dropout probability of the attention
+        probabilities.
+      intra_bottleneck_size: Size of bottleneck.
+      initializer_range: The stddev of the `truncated_normal_initializer` for
+        initializing all weight matrices.
+      use_bottleneck_attention: Use attention inputs from the bottleneck
+        transformation. If true, the following `key_query_shared_bottleneck`
+        will be ignored.
+      key_query_shared_bottleneck: Whether to share linear transformation for
+        keys and queries.
+      num_feedforward_networks: Number of stacked feed-forward networks.
+      normalization_type: The type of normalization_type, only `no_norm` and
+        `layer_norm` are supported. `no_norm` represents the element-wise linear
+        transformation for the student model, as suggested by the original
+        MobileBERT paper. `layer_norm` is used for the teacher model.
+      classifier_activation: If using the tanh activation for the final
+        representation of the `[CLS]` token in fine-tuning.
+      input_mask_dtype: The dtype of `input_mask` tensor, which is one of the
+        input tensors of this encoder. Defaults to `int32`. If you want
+        to use `tf.lite` quantization, which does not support `Cast` op,
+        please set this argument to `tf.float32` and feed `input_mask`
+        tensor with values in `float32` to avoid `tf.cast` in the computation.
+      quantization_friendly: If enabled, the model calss EdgeTPU mobile
+        transformer. The difference is we have a customized softmax
+        ops which use -120 as the mask value, which is more stable for post-
+        training quantization.
+      **kwargs: Other keyworded and arguments.
+    """
+    self._self_setattr_tracking = False
+    initializer = tf.keras.initializers.TruncatedNormal(
+        stddev=initializer_range)
+
+    # layer instantiation
+    self.embedding_layer = layers.MobileBertEmbedding(
+        word_vocab_size=word_vocab_size,
+        word_embed_size=word_embed_size,
+        type_vocab_size=type_vocab_size,
+        output_embed_size=hidden_size,
+        max_sequence_length=max_sequence_length,
+        normalization_type=normalization_type,
+        initializer=initializer,
+        dropout_rate=hidden_dropout_prob)
+
+    self._transformer_layers = []
+    transformer_layer_args = dict(
+        hidden_size=hidden_size,
+        num_attention_heads=num_attention_heads,
+        intermediate_size=intermediate_size,
+        intermediate_act_fn=intermediate_act_fn,
+        hidden_dropout_prob=hidden_dropout_prob,
+        attention_probs_dropout_prob=attention_probs_dropout_prob,
+        intra_bottleneck_size=intra_bottleneck_size,
+        use_bottleneck_attention=use_bottleneck_attention,
+        key_query_shared_bottleneck=key_query_shared_bottleneck,
+        num_feedforward_networks=num_feedforward_networks,
+        normalization_type=normalization_type,
+        initializer=initializer,
+        )
+    for layer_idx in range(num_blocks):
+      if quantization_friendly:
+        transformer = edgetpu_layers.EdgetpuMobileBertTransformer(
+            name=f'transformer_layer_{layer_idx}',
+            **transformer_layer_args)
+      else:
+        transformer = layers.MobileBertTransformer(
+            name=f'transformer_layer_{layer_idx}',
+            **transformer_layer_args)
+      self._transformer_layers.append(transformer)
+
+    # input tensor
+    input_ids = tf.keras.layers.Input(
+        shape=(None,), dtype=tf.int32, name='input_word_ids')
+    type_ids = tf.keras.layers.Input(
+        shape=(None,), dtype=tf.int32, name='input_type_ids')
+    input_mask = tf.keras.layers.Input(
+        shape=(None,), dtype=input_mask_dtype, name='input_mask')
+    self.inputs = [input_ids, input_mask, type_ids]
+
+    # The dtype of `attention_mask` will the same as the dtype of `input_mask`.
+    attention_mask = modeling.layers.SelfAttentionMask()(input_mask, input_mask)
+
+    # build the computation graph
+    all_layer_outputs = []
+    all_attention_scores = []
+    embedding_output = self.embedding_layer(input_ids, type_ids)
+    all_layer_outputs.append(embedding_output)
+    prev_output = embedding_output
+
+    for layer_idx in range(num_blocks):
+      layer_output, attention_score = self._transformer_layers[layer_idx](
+          prev_output,
+          attention_mask,
+          return_attention_scores=True)
+      all_layer_outputs.append(layer_output)
+      all_attention_scores.append(attention_score)
+      prev_output = layer_output
+    first_token = tf.squeeze(prev_output[:, 0:1, :], axis=1)
+
+    if classifier_activation:
+      self._pooler_layer = tf.keras.layers.experimental.EinsumDense(
+          'ab,bc->ac',
+          output_shape=hidden_size,
+          activation=tf.tanh,
+          bias_axes='c',
+          kernel_initializer=initializer,
+          name='pooler')
+      first_token = self._pooler_layer(first_token)
+    else:
+      self._pooler_layer = None
+
+    outputs = dict(
+        sequence_output=prev_output,
+        pooled_output=first_token,
+        encoder_outputs=all_layer_outputs,
+        attention_scores=all_attention_scores)
+
+    super(MobileBERTEncoder, self).__init__(
+        inputs=self.inputs, outputs=outputs, **kwargs)
+
+  def get_embedding_table(self):
+    return self.embedding_layer.word_embedding.embeddings
+
+  def get_embedding_layer(self):
+    return self.embedding_layer.word_embedding
+
+  @property
+  def transformer_layers(self):
+    """List of Transformer layers in the encoder."""
+    return self._transformer_layers
+
+  @property
+  def pooler_layer(self):
+    """The pooler dense layer after the transformer layers."""
+    return self._pooler_layer