Commit 9114f2a3 authored by A. Unique TensorFlower's avatar A. Unique TensorFlower Committed by saberkun
Browse files

Internal change

PiperOrigin-RevId: 404080616
parent ec0d7d0b
# MobileBERT-EdgeTPU
<figure align="center">
<img width=70% src=https://storage.googleapis.com/tf_model_garden/models/edgetpu/images/readme-mobilebert.png>
<figcaption>Performance of MobileBERT-EdgeTPU models on the SQuAD v1.1 dataset.</figcaption>
</figure>
Note: For MobileBERT baseline float model, NNAPI delegates parts of the
computing ops to CPU, making the latency much higher.
Note: The accuracy numbers for BERT_base and BERT_large are from the
[training results](https://arxiv.org/abs/1810.04805). These models are too large
and not feasible to run on device.
Deploying low-latency, high-quality transformer based language models on device
is highly desirable, and can potentially benefit multiple applications such as
automatic speech recognition (ASR), translation, sentence autocompletion, and
even some vision tasks. By co-designing the neural networks with the Edge TPU
hardware accelerator in Google Tensor SoC, we have built EdgeTPU-customized
MobileBERT models that demonstrate datacenter model quality meanwhile
outperforms baseline MobileBERT's latency.
We set up our model architecture search space based on
[MobileBERT](https://arxiv.org/abs/2004.02984) and leverage AutoML algorithms to
find models with up to 2x better hardware utilization. With higher utilization,
we are able to bring larger and more accurate models on chip, and meanwhile the
models can still outperform the baseline MobileBERT latency. We built a
customized distillation training pipeline and performed exhaustive
hyperparameters (e.g. learning rate, dropout ratio, etc) search to achieve the
best accuracy. As shown in the above figure, the quantized MobileBERT-EdgeTPU
models establish a new pareto-frontier for the question answering tasks and also
exceed the accuracy of the float BERT_base model which is 400+MB and too large
to run on edge devices.
We also observed that, unlike most vision models, the accuracy drops
significantly for MobileBERT/MobileBERT-EdgeTPU with plain post training
quantization (PTQ) or quantization aware training (QAT). Proper model
modifications, such as clipping the mask value, are necessary to retain the
accuracy for a quantized model. Therefore, as an alternative to the quant
models, we also provide a set of Edge TPU friendly float models which also
produce a better (though marginally) roofline than the baseline MobileBERT quant
model. Notably, the float MobileBERT-EdgeTPU-M model yields accuracy that is
even close to the BERT_large, which has 1.3GB model size in float precision.
Quantization now becomes an optional optimization rather than a prerequisite,
which can greatly benefit/unblock some use cases where quantization is
infeasible or introduce large accuracy deterioration, and potentially reduce the
time-to-market.
## Pre-trained Models
Model name | # Parameters | # Ops | MLM | Checkpoint | TFhub link
--------------------- | :----------: | :----: | :---: | :---: | :--------:
MobileBERT-EdgeTPU-M | 50.9M | 18.8e9 | 73.8% | WIP | WIP
MobileBERT-EdgeTPU-S | 38.3M | 14.0e9 | 72.8% | WIP | WIP
MobileBERT-EdgeTPU-XS | 27.1M | 9.4e9 | 71.2% | WIP | WIP
### Restoring from Checkpoints
To load the pre-trained MobileBERT checkpoint in your code, please follow the
example below or check the `serving/export_tflite_squad` module:
```python
import tensorflow as tf
from official.nlp.projects.mobilebert_edgetpu import params
bert_config_file = ...
model_checkpoint_path = ...
# Set up experiment params and load the configs from file/files.
experiment_params = params.EdgeTPUBERTCustomParams()
# change the input mask type to tf.float32 to avoid additional casting op.
experiment_params.student_model.encoder.mobilebert.input_mask_dtype = 'float32'
pretrainer_model = model_builder.build_bert_pretrainer(
experiment_params.student_model,
name='pretrainer',
quantization_friendly=True)
checkpoint_dict = {'model': pretrainer_model}
checkpoint = tf.train.Checkpoint(**checkpoint_dict)
checkpoint.restore(FLAGS.model_checkpoint).assert_existing_objects_matched()
```
### Use TF-Hub models
TODO(longy): Update with instructions to use tf-hub models
# Copyright 2021 The TensorFlow Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# Copyright 2021 The TensorFlow Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# Copyright 2021 The TensorFlow Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Datastructures for all the configurations for MobileBERT-EdgeTPU training."""
import dataclasses
from typing import Optional
from official.modeling import optimization
from official.modeling.hyperparams import base_config
from official.nlp.configs import bert
from official.nlp.data import pretrain_dataloader
DatasetParams = pretrain_dataloader.BertPretrainDataConfig
PretrainerModelParams = bert.PretrainerConfig
@dataclasses.dataclass
class OrbitParams(base_config.Config):
"""Parameters that setup Orbit training/evaluation pipeline.
Attributes:
mode: Orbit controller mode, can be 'train', 'train_and_evaluate', or
'evaluate'.
steps_per_loop: The number of steps to run in each inner loop of training.
total_steps: The global step count to train up to.
eval_steps: The number of steps to run during an evaluation. If -1, this
method will evaluate over the entire evaluation dataset.
eval_interval: The number of training steps to run between evaluations. If
set, training will always stop every `eval_interval` steps, even if this
results in a shorter inner loop than specified by `steps_per_loop`
setting. If None, evaluation will only be performed after training is
complete.
"""
mode: str = 'train'
steps_per_loop: int = 1000
total_steps: int = 1000000
eval_steps: int = -1
eval_interval: Optional[int] = None
@dataclasses.dataclass
class OptimizerParams(optimization.OptimizationConfig):
"""Optimizer parameters for MobileBERT-EdgeTPU."""
optimizer: optimization.OptimizerConfig = optimization.OptimizerConfig(
type='adamw',
adamw=optimization.AdamWeightDecayConfig(
weight_decay_rate=0.01,
exclude_from_weight_decay=['LayerNorm', 'layer_norm', 'bias']))
learning_rate: optimization.LrConfig = optimization.LrConfig(
type='polynomial',
polynomial=optimization.PolynomialLrConfig(
initial_learning_rate=1e-4,
decay_steps=1000000,
end_learning_rate=0.0))
warmup: optimization.WarmupConfig = optimization.WarmupConfig(
type='polynomial',
polynomial=optimization.PolynomialWarmupConfig(warmup_steps=10000))
@dataclasses.dataclass
class RuntimeParams(base_config.Config):
"""Parameters that set up the training runtime.
TODO(longy): Can reuse the Runtime Config in:
official/core/config_definitions.py
Attributes
distribution_strategy: Keras distribution strategy
use_gpu: Whether to use GPU
use_tpu: Whether to use TPU
num_gpus: Number of gpus to use for training
num_workers: Number of parallel workers
tpu_address: The bns address of the TPU to use.
"""
distribution_strategy: str = 'off'
num_gpus: Optional[int] = 0
all_reduce_alg: Optional[str] = None
num_workers: int = 1
tpu_address: str = ''
use_gpu: Optional[bool] = None
use_tpu: Optional[bool] = None
@dataclasses.dataclass
class LayerWiseDistillationParams(base_config.Config):
"""Define the behavior of layer-wise distillation.
Layer-wise distillation is an optional step where the knowledge is transferred
layerwisely for all the transformer layers. The end-to-end distillation is
performed after layer-wise distillation if layer-wise distillation steps is
not zero.
"""
num_steps: int = 10000
warmup_steps: int = 10000
initial_learning_rate: float = 1.5e-3
end_learning_rate: float = 1.5e-3
decay_steps: int = 10000
hidden_distill_factor: float = 100.0
beta_distill_factor: float = 5000.0
gamma_distill_factor: float = 5.0
attention_distill_factor: float = 1.0
@dataclasses.dataclass
class EndToEndDistillationParams(base_config.Config):
"""Define the behavior of end2end pretrainer distillation."""
num_steps: int = 580000
warmup_steps: int = 20000
initial_learning_rate: float = 1.5e-3
end_learning_rate: float = 1.5e-7
decay_steps: int = 580000
distill_ground_truth_ratio: float = 0.5
@dataclasses.dataclass
class EdgeTPUBERTCustomParams(base_config.Config):
"""EdgeTPU-BERT custom params.
Attributes:
train_dataset: An instance of the DatasetParams.
eval_dataset: An instance of the DatasetParams.
teacher_model: An instance of the PretrainerModelParams. If None, then the
student model is trained independently without distillation.
student_model: An instance of the PretrainerModelParams
teacher_model_init_checkpoint: Path for the teacher model init checkpoint.
student_model_init_checkpoint: Path for the student model init checkpoint.
layer_wise_distillation: Distillation config for the layer-wise step.
end_to_end_distillation: Distillation config for the end2end step.
optimizer: An instance of the OptimizerParams.
runtime: An instance of the RuntimeParams.
learning_rate: An instance of the LearningRateParams.
orbit_config: An instance of the OrbitParams.
distill_ground_truth_ratio: A float number representing the ratio between
distillation output and ground truth.
"""
train_datasest: DatasetParams = DatasetParams()
eval_dataset: DatasetParams = DatasetParams()
teacher_model: Optional[PretrainerModelParams] = PretrainerModelParams()
student_model: PretrainerModelParams = PretrainerModelParams()
teacher_model_init_checkpoint: str = ''
student_model_init_checkpoint: str = ''
layer_wise_distillation: LayerWiseDistillationParams = (
LayerWiseDistillationParams())
end_to_end_distillation: EndToEndDistillationParams = (
EndToEndDistillationParams())
optimizer: OptimizerParams = OptimizerParams()
runtime: RuntimeParams = RuntimeParams()
orbit_config: OrbitParams = OrbitParams()
task:
# hub_module_url: 'gs://**/panzf/mobilebert/tfhub/'
init_checkpoint: 'gs://**/edgetpu_bert/edgetpu_bert_float_candidate_13_e2e_820k/exported_ckpt/'
model:
num_classes: 3
metric_type: 'accuracy'
train_data:
drop_remainder: true
global_batch_size: 32
input_path: gs://**/yo/bert/glue/tfrecords/MNLI/MNLI_matched_train.tf_record
is_training: true
seq_length: 128
label_type: 'int'
validation_data:
drop_remainder: false
global_batch_size: 32
input_path: gs://**/yo/bert/glue/tfrecords/MNLI/MNLI_matched_eval.tf_record
is_training: false
seq_length: 128
label_type: 'int'
trainer:
checkpoint_interval: 10000
optimizer_config:
learning_rate:
polynomial:
# 100% of train_steps.
decay_steps: 50000
end_learning_rate: 0.0
initial_learning_rate: 3.0e-05
power: 1.0
type: polynomial
optimizer:
type: adamw
warmup:
polynomial:
power: 1
# ~10% of train_steps.
warmup_steps: 5000
type: polynomial
steps_per_loop: 1000
summary_interval: 1000
# Training data size 392,702 examples, 8 epochs.
train_steps: 50000
validation_interval: 2000
# Eval data size = 9815 examples.
validation_steps: 307
best_checkpoint_export_subdir: 'best_ckpt'
best_checkpoint_eval_metric: 'cls_accuracy'
best_checkpoint_metric_comp: 'higher'
# MobileBERT model from https://arxiv.org/abs/2004.02984.
task:
model:
encoder:
type: mobilebert
mobilebert:
word_vocab_size: 30522
word_embed_size: 128
type_vocab_size: 2
max_sequence_length: 512
num_blocks: 24
hidden_size: 512
num_attention_heads: 4
intermediate_size: 512
hidden_activation: relu
hidden_dropout_prob: 0.0
attention_probs_dropout_prob: 0.1
intra_bottleneck_size: 128
initializer_range: 0.02
key_query_shared_bottleneck: true
num_feedforward_networks: 4
normalization_type: no_norm
classifier_activation: false
# MobileBERT-EdgeTPU model.
task:
model:
encoder:
type: mobilebert
mobilebert:
word_vocab_size: 30522
word_embed_size: 128
type_vocab_size: 2
max_sequence_length: 512
num_blocks: 12
hidden_size: 512
num_attention_heads: 4
intermediate_size: 1024
hidden_activation: relu
hidden_dropout_prob: 0.1
attention_probs_dropout_prob: 0.1
intra_bottleneck_size: 256
initializer_range: 0.02
key_query_shared_bottleneck: true
num_feedforward_networks: 6
normalization_type: no_norm
classifier_activation: false
# MobileBERT-EdgeTPU-S model.
task:
model:
encoder:
type: mobilebert
mobilebert:
word_vocab_size: 30522
word_embed_size: 128
type_vocab_size: 2
max_sequence_length: 512
num_blocks: 12
hidden_size: 512
num_attention_heads: 4
intermediate_size: 1024
hidden_activation: relu
hidden_dropout_prob: 0.1
attention_probs_dropout_prob: 0.1
intra_bottleneck_size: 256
initializer_range: 0.02
key_query_shared_bottleneck: true
num_feedforward_networks: 4
normalization_type: no_norm
classifier_activation: false
# MobileBERT-EdgeTPU-XS model.
task:
model:
encoder:
type: mobilebert
mobilebert:
word_vocab_size: 30522
word_embed_size: 128
type_vocab_size: 2
max_sequence_length: 512
num_blocks: 8
hidden_size: 512
num_attention_heads: 4
intermediate_size: 1024
hidden_activation: relu
hidden_dropout_prob: 0.1
attention_probs_dropout_prob: 0.1
intra_bottleneck_size: 256
initializer_range: 0.02
key_query_shared_bottleneck: true
num_feedforward_networks: 4
normalization_type: no_norm
classifier_activation: false
task:
# hub_module_url: 'gs://**/panzf/mobilebert/tfhub/'
max_answer_length: 30
n_best_size: 20
null_score_diff_threshold: 0.0
init_checkpoint: 'gs://**/edgetpu_bert/edgetpu_bert_float_candidate_13_e2e_820k/exported_ckpt/'
train_data:
drop_remainder: true
global_batch_size: 32
input_path: gs://**/tp/bert/squad_v1.1/train.tf_record
is_training: true
seq_length: 384
validation_data:
do_lower_case: true
doc_stride: 128
drop_remainder: false
global_batch_size: 48
input_path: gs://**/squad/dev-v1.1.json
is_training: false
query_length: 64
seq_length: 384
tokenization: WordPiece
version_2_with_negative: false
vocab_file: gs://**/panzf/ttl-30d/mobilebert/tf2_checkpoint/vocab.txt
trainer:
checkpoint_interval: 1000
max_to_keep: 5
optimizer_config:
learning_rate:
polynomial:
decay_steps: 19420
end_learning_rate: 0.0
initial_learning_rate: 8.0e-05
power: 1.0
type: polynomial
optimizer:
type: adamw
warmup:
polynomial:
power: 1
# 10% of total training steps
warmup_steps: 1942
type: polynomial
steps_per_loop: 1000
summary_interval: 1000
# 7 epochs for training
train_steps: 19420
validation_interval: 3000
validation_steps: 226
best_checkpoint_export_subdir: 'best_ckpt'
best_checkpoint_eval_metric: 'final_f1'
best_checkpoint_metric_comp: 'higher'
# Distillation pretraining for Mobilebert.
# The final MLM accuracy is around 70.8% for e2e only training and 71.4% for layer-wise + e2e.
layer_wise_distillation:
num_steps: 10000
warmup_steps: 0
initial_learning_rate: 1.5e-3
end_learning_rate: 1.5e-3
decay_steps: 10000
end_to_end_distillation:
num_steps: 585000
warmup_steps: 20000
initial_learning_rate: 1.5e-3
end_learning_rate: 1.5e-7
decay_steps: 585000
distill_ground_truth_ratio: 0.5
optimizer:
optimizer:
lamb:
beta_1: 0.9
beta_2: 0.999
clipnorm: 1.0
epsilon: 1.0e-06
exclude_from_layer_adaptation: null
exclude_from_weight_decay: ['LayerNorm', 'bias', 'norm']
global_clipnorm: null
name: LAMB
weight_decay_rate: 0.01
type: lamb
orbit_config:
eval_interval: 1000
eval_steps: -1
mode: train
steps_per_loop: 1000
total_steps: 825000
runtime:
distribution_strategy: 'tpu'
student_model:
cls_heads: [{'activation': 'tanh',
'cls_token_idx': 0,
'dropout_rate': 0.0,
'inner_dim': 512,
'name': 'next_sentence',
'num_classes': 2}]
encoder:
mobilebert:
attention_probs_dropout_prob: 0.1
classifier_activation: false
hidden_activation: relu
hidden_dropout_prob: 0.0
hidden_size: 512
initializer_range: 0.02
input_mask_dtype: int32
intermediate_size: 512
intra_bottleneck_size: 128
key_query_shared_bottleneck: true
max_sequence_length: 512
normalization_type: no_norm
num_attention_heads: 4
num_blocks: 24
num_feedforward_networks: 4
type_vocab_size: 2
use_bottleneck_attention: false
word_embed_size: 128
word_vocab_size: 30522
type: mobilebert
mlm_activation: relu
mlm_initializer_range: 0.02
teacher_model:
cls_heads: []
encoder:
mobilebert:
attention_probs_dropout_prob: 0.1
classifier_activation: false
hidden_activation: gelu
hidden_dropout_prob: 0.1
hidden_size: 512
initializer_range: 0.02
input_mask_dtype: int32
intermediate_size: 4096
intra_bottleneck_size: 1024
key_query_shared_bottleneck: false
max_sequence_length: 512
normalization_type: layer_norm
num_attention_heads: 4
num_blocks: 24
num_feedforward_networks: 1
type_vocab_size: 2
use_bottleneck_attention: false
word_embed_size: 128
word_vocab_size: 30522
type: mobilebert
mlm_activation: gelu
mlm_initializer_range: 0.02
teacher_model_init_checkpoint: gs://**/uncased_L-24_H-1024_B-512_A-4_teacher/tf2_checkpoint/bert_model.ckpt-1
student_model_init_checkpoint: ''
train_datasest:
block_length: 1
cache: false
cycle_length: null
deterministic: null
drop_remainder: true
enable_tf_data_service: false
global_batch_size: 2048
input_path: gs://**/seq_512_mask_20/wikipedia.tfrecord*,gs://**/seq_512_mask_20/books.tfrecord*
is_training: true
max_predictions_per_seq: 20
seq_length: 512
sharding: true
shuffle_buffer_size: 100
tf_data_service_address: null
tf_data_service_job_name: null
tfds_as_supervised: false
tfds_data_dir: ''
tfds_name: ''
tfds_skip_decoding_feature: ''
tfds_split: ''
use_next_sentence_label: true
use_position_id: false
use_v2_feature_names: false
eval_dataset:
block_length: 1
cache: false
cycle_length: null
deterministic: null
drop_remainder: true
enable_tf_data_service: false
global_batch_size: 2048
input_path: gs://**/seq_512_mask_20/wikipedia.tfrecord-00141-of-00500,gs://**/seq_512_mask_20/books.tfrecord-00141-of-00500
is_training: false
max_predictions_per_seq: 20
seq_length: 512
sharding: true
shuffle_buffer_size: 100
tf_data_service_address: null
tf_data_service_job_name: null
tfds_as_supervised: false
tfds_data_dir: ''
tfds_name: ''
tfds_skip_decoding_feature: ''
tfds_split: ''
use_next_sentence_label: true
use_position_id: false
use_v2_feature_names: false
layer_wise_distillation:
num_steps: 20000
warmup_steps: 0
initial_learning_rate: 1.5e-3
end_learning_rate: 1.5e-3
decay_steps: 20000
end_to_end_distillation:
num_steps: 585000
warmup_steps: 20000
initial_learning_rate: 1.5e-3
end_learning_rate: 1.5e-7
decay_steps: 585000
distill_ground_truth_ratio: 0.5
optimizer:
optimizer:
lamb:
beta_1: 0.9
beta_2: 0.999
clipnorm: 1.0
epsilon: 1.0e-06
exclude_from_layer_adaptation: null
exclude_from_weight_decay: ['LayerNorm', 'bias', 'norm']
global_clipnorm: null
name: LAMB
weight_decay_rate: 0.01
type: lamb
orbit_config:
eval_interval: 1000
eval_steps: -1
mode: train
steps_per_loop: 1000
total_steps: 825000
runtime:
distribution_strategy: 'tpu'
student_model:
cls_heads: [{'activation': 'tanh',
'cls_token_idx': 0,
'dropout_rate': 0.0,
'inner_dim': 512,
'name': 'next_sentence',
'num_classes': 2}]
encoder:
mobilebert:
attention_probs_dropout_prob: 0.1
classifier_activation: false
hidden_activation: relu
hidden_dropout_prob: 0.0
hidden_size: 512
initializer_range: 0.02
input_mask_dtype: int32
intermediate_size: 1024
intra_bottleneck_size: 256
key_query_shared_bottleneck: true
max_sequence_length: 512
normalization_type: no_norm
num_attention_heads: 4
num_blocks: 12
num_feedforward_networks: 6
type_vocab_size: 2
use_bottleneck_attention: false
word_embed_size: 128
word_vocab_size: 30522
type: mobilebert
mlm_activation: relu
mlm_initializer_range: 0.02
teacher_model:
cls_heads: []
encoder:
mobilebert:
attention_probs_dropout_prob: 0.1
classifier_activation: false
hidden_activation: gelu
hidden_dropout_prob: 0.1
hidden_size: 512
initializer_range: 0.02
input_mask_dtype: int32
intermediate_size: 4096
intra_bottleneck_size: 1024
key_query_shared_bottleneck: false
max_sequence_length: 512
normalization_type: layer_norm
num_attention_heads: 4
num_blocks: 24
num_feedforward_networks: 1
type_vocab_size: 2
use_bottleneck_attention: false
word_embed_size: 128
word_vocab_size: 30522
type: mobilebert
mlm_activation: gelu
mlm_initializer_range: 0.02
teacher_model_init_checkpoint: gs://**/uncased_L-24_H-1024_B-512_A-4_teacher/tf2_checkpoint/bert_model.ckpt-1
student_model_init_checkpoint: ''
train_datasest:
block_length: 1
cache: false
cycle_length: null
deterministic: null
drop_remainder: true
enable_tf_data_service: false
global_batch_size: 2048
input_path: gs://**/seq_512_mask_20/wikipedia.tfrecord*,gs://**/seq_512_mask_20/books.tfrecord*
is_training: true
max_predictions_per_seq: 20
seq_length: 512
sharding: true
shuffle_buffer_size: 100
tf_data_service_address: null
tf_data_service_job_name: null
tfds_as_supervised: false
tfds_data_dir: ''
tfds_name: ''
tfds_skip_decoding_feature: ''
tfds_split: ''
use_next_sentence_label: true
use_position_id: false
use_v2_feature_names: false
eval_dataset:
block_length: 1
cache: false
cycle_length: null
deterministic: null
drop_remainder: true
enable_tf_data_service: false
global_batch_size: 2048
input_path: gs://**/seq_512_mask_20/wikipedia.tfrecord-00141-of-00500,gs://**/seq_512_mask_20/books.tfrecord-00141-of-00500
is_training: false
max_predictions_per_seq: 20
seq_length: 512
sharding: true
shuffle_buffer_size: 100
tf_data_service_address: null
tf_data_service_job_name: null
tfds_as_supervised: false
tfds_data_dir: ''
tfds_name: ''
tfds_skip_decoding_feature: ''
tfds_split: ''
use_next_sentence_label: true
use_position_id: false
use_v2_feature_names: false
layer_wise_distillation:
num_steps: 20000
warmup_steps: 0
initial_learning_rate: 1.5e-3
end_learning_rate: 1.5e-3
decay_steps: 20000
end_to_end_distillation:
num_steps: 585000
warmup_steps: 20000
initial_learning_rate: 1.5e-3
end_learning_rate: 1.5e-7
decay_steps: 585000
distill_ground_truth_ratio: 0.5
optimizer:
optimizer:
lamb:
beta_1: 0.9
beta_2: 0.999
clipnorm: 1.0
epsilon: 1.0e-06
exclude_from_layer_adaptation: null
exclude_from_weight_decay: ['LayerNorm', 'bias', 'norm']
global_clipnorm: null
name: LAMB
weight_decay_rate: 0.01
type: lamb
orbit_config:
eval_interval: 1000
eval_steps: -1
mode: train
steps_per_loop: 1000
total_steps: 825000
runtime:
distribution_strategy: 'tpu'
student_model:
cls_heads: [{'activation': 'tanh',
'cls_token_idx': 0,
'dropout_rate': 0.0,
'inner_dim': 512,
'name': 'next_sentence',
'num_classes': 2}]
encoder:
mobilebert:
attention_probs_dropout_prob: 0.1
classifier_activation: false
hidden_activation: relu
hidden_dropout_prob: 0.0
hidden_size: 512
initializer_range: 0.02
input_mask_dtype: int32
intermediate_size: 1024
intra_bottleneck_size: 256
key_query_shared_bottleneck: true
max_sequence_length: 512
normalization_type: no_norm
num_attention_heads: 4
num_blocks: 12
num_feedforward_networks: 4
type_vocab_size: 2
use_bottleneck_attention: false
word_embed_size: 128
word_vocab_size: 30522
type: mobilebert
mlm_activation: relu
mlm_initializer_range: 0.02
teacher_model:
cls_heads: []
encoder:
mobilebert:
attention_probs_dropout_prob: 0.1
classifier_activation: false
hidden_activation: gelu
hidden_dropout_prob: 0.1
hidden_size: 512
initializer_range: 0.02
input_mask_dtype: int32
intermediate_size: 4096
intra_bottleneck_size: 1024
key_query_shared_bottleneck: false
max_sequence_length: 512
normalization_type: layer_norm
num_attention_heads: 4
num_blocks: 24
num_feedforward_networks: 1
type_vocab_size: 2
use_bottleneck_attention: false
word_embed_size: 128
word_vocab_size: 30522
type: mobilebert
mlm_activation: gelu
mlm_initializer_range: 0.02
teacher_model_init_checkpoint: gs://**/uncased_L-24_H-1024_B-512_A-4_teacher/tf2_checkpoint/bert_model.ckpt-1
student_model_init_checkpoint: ''
train_datasest:
block_length: 1
cache: false
cycle_length: null
deterministic: null
drop_remainder: true
enable_tf_data_service: false
global_batch_size: 2048
input_path: gs://**/seq_512_mask_20/wikipedia.tfrecord*,gs://**/seq_512_mask_20/books.tfrecord*
is_training: true
max_predictions_per_seq: 20
seq_length: 512
sharding: true
shuffle_buffer_size: 100
tf_data_service_address: null
tf_data_service_job_name: null
tfds_as_supervised: false
tfds_data_dir: ''
tfds_name: ''
tfds_skip_decoding_feature: ''
tfds_split: ''
use_next_sentence_label: true
use_position_id: false
use_v2_feature_names: false
eval_dataset:
block_length: 1
cache: false
cycle_length: null
deterministic: null
drop_remainder: true
enable_tf_data_service: false
global_batch_size: 2048
input_path: gs://**/seq_512_mask_20/wikipedia.tfrecord-00141-of-00500,gs://**/seq_512_mask_20/books.tfrecord-00141-of-00500
is_training: false
max_predictions_per_seq: 20
seq_length: 512
sharding: true
shuffle_buffer_size: 100
tf_data_service_address: null
tf_data_service_job_name: null
tfds_as_supervised: false
tfds_data_dir: ''
tfds_name: ''
tfds_skip_decoding_feature: ''
tfds_split: ''
use_next_sentence_label: true
use_position_id: false
use_v2_feature_names: false
layer_wise_distillation:
num_steps: 30000
warmup_steps: 0
initial_learning_rate: 1.5e-3
end_learning_rate: 1.5e-3
decay_steps: 30000
end_to_end_distillation:
num_steps: 585000
warmup_steps: 20000
initial_learning_rate: 1.5e-3
end_learning_rate: 1.5e-7
decay_steps: 585000
distill_ground_truth_ratio: 0.5
optimizer:
optimizer:
lamb:
beta_1: 0.9
beta_2: 0.999
clipnorm: 1.0
epsilon: 1.0e-06
exclude_from_layer_adaptation: null
exclude_from_weight_decay: ['LayerNorm', 'bias', 'norm']
global_clipnorm: null
name: LAMB
weight_decay_rate: 0.01
type: lamb
orbit_config:
eval_interval: 1000
eval_steps: -1
mode: train
steps_per_loop: 1000
total_steps: 825000
runtime:
distribution_strategy: 'tpu'
student_model:
cls_heads: [{'activation': 'tanh',
'cls_token_idx': 0,
'dropout_rate': 0.0,
'inner_dim': 512,
'name': 'next_sentence',
'num_classes': 2}]
encoder:
mobilebert:
attention_probs_dropout_prob: 0.1
classifier_activation: false
hidden_activation: relu
hidden_dropout_prob: 0.0
hidden_size: 512
initializer_range: 0.02
input_mask_dtype: int32
intermediate_size: 1024
intra_bottleneck_size: 256
key_query_shared_bottleneck: true
max_sequence_length: 512
normalization_type: no_norm
num_attention_heads: 4
num_blocks: 8
num_feedforward_networks: 4
type_vocab_size: 2
use_bottleneck_attention: false
word_embed_size: 128
word_vocab_size: 30522
type: mobilebert
mlm_activation: relu
mlm_initializer_range: 0.02
teacher_model:
cls_heads: []
encoder:
mobilebert:
attention_probs_dropout_prob: 0.1
classifier_activation: false
hidden_activation: gelu
hidden_dropout_prob: 0.1
hidden_size: 512
initializer_range: 0.02
input_mask_dtype: int32
intermediate_size: 4096
intra_bottleneck_size: 1024
key_query_shared_bottleneck: false
max_sequence_length: 512
normalization_type: layer_norm
num_attention_heads: 4
num_blocks: 24
num_feedforward_networks: 1
type_vocab_size: 2
use_bottleneck_attention: false
word_embed_size: 128
word_vocab_size: 30522
type: mobilebert
mlm_activation: gelu
mlm_initializer_range: 0.02
teacher_model_init_checkpoint: gs://**/uncased_L-24_H-1024_B-512_A-4_teacher/tf2_checkpoint/bert_model.ckpt-1
student_model_init_checkpoint: ''
train_datasest:
block_length: 1
cache: false
cycle_length: null
deterministic: null
drop_remainder: true
enable_tf_data_service: false
global_batch_size: 2048
input_path: gs://**/seq_512_mask_20/wikipedia.tfrecord*,gs://**/seq_512_mask_20/books.tfrecord*
is_training: true
max_predictions_per_seq: 20
seq_length: 512
sharding: true
shuffle_buffer_size: 100
tf_data_service_address: null
tf_data_service_job_name: null
tfds_as_supervised: false
tfds_data_dir: ''
tfds_name: ''
tfds_skip_decoding_feature: ''
tfds_split: ''
use_next_sentence_label: true
use_position_id: false
use_v2_feature_names: false
eval_dataset:
block_length: 1
cache: false
cycle_length: null
deterministic: null
drop_remainder: true
enable_tf_data_service: false
global_batch_size: 2048
input_path: gs://**/seq_512_mask_20/wikipedia.tfrecord-00141-of-00500,gs://**/seq_512_mask_20/books.tfrecord-00141-of-00500
is_training: false
max_predictions_per_seq: 20
seq_length: 512
sharding: true
shuffle_buffer_size: 100
tf_data_service_address: null
tf_data_service_job_name: null
tfds_as_supervised: false
tfds_data_dir: ''
tfds_name: ''
tfds_skip_decoding_feature: ''
tfds_split: ''
use_next_sentence_label: true
use_position_id: false
use_v2_feature_names: false
This diff is collapsed.
# Copyright 2021 The TensorFlow Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Tests for mobilebert_edgetpu_trainer.py."""
import tensorflow as tf
from official.projects.edgetpu.nlp import mobilebert_edgetpu_trainer
from official.projects.edgetpu.nlp.configs import params
from official.projects.edgetpu.nlp.modeling import model_builder
# Helper function to create dummy dataset
def _dummy_dataset():
def dummy_data(_):
dummy_ids = tf.zeros((1, 64), dtype=tf.int32)
dummy_lm = tf.zeros((1, 64), dtype=tf.int32)
return dict(
input_word_ids=dummy_ids,
input_mask=dummy_ids,
input_type_ids=dummy_ids,
masked_lm_positions=dummy_lm,
masked_lm_ids=dummy_lm,
masked_lm_weights=tf.cast(dummy_lm, dtype=tf.float32),
next_sentence_labels=tf.zeros((1, 1), dtype=tf.int32))
dataset = tf.data.Dataset.range(1)
dataset = dataset.repeat()
dataset = dataset.map(
dummy_data, num_parallel_calls=tf.data.experimental.AUTOTUNE)
return dataset
class EdgetpuBertTrainerTest(tf.test.TestCase):
def setUp(self):
super(EdgetpuBertTrainerTest, self).setUp()
config_path = 'third_party/tensorflow_models/official/projects/edgetpu/nlp/experiments/mobilebert_edgetpu_m.yaml'
self.experiment_params = params.EdgeTPUBERTCustomParams.from_yaml(
config_path)
self.strategy = tf.distribute.get_strategy()
self.experiment_params.train_datasest.input_path = 'dummy'
self.experiment_params.eval_dataset.input_path = 'dummy'
def test_train_model_locally(self):
"""Tests training a model locally with one step."""
teacher_model = model_builder.build_bert_pretrainer(
pretrainer_cfg=self.experiment_params.teacher_model,
name='teacher')
_ = teacher_model(teacher_model.inputs)
student_model = model_builder.build_bert_pretrainer(
pretrainer_cfg=self.experiment_params.student_model,
name='student')
_ = student_model(student_model.inputs)
trainer = mobilebert_edgetpu_trainer.MobileBERTEdgeTPUDistillationTrainer(
teacher_model=teacher_model,
student_model=student_model,
strategy=self.strategy,
experiment_params=self.experiment_params)
# Rebuild dummy dataset since loading real dataset will cause timeout error.
trainer.train_dataset = _dummy_dataset()
trainer.eval_dataset = _dummy_dataset()
train_dataset_iter = iter(trainer.train_dataset)
eval_dataset_iter = iter(trainer.eval_dataset)
trainer.train_loop_begin()
trainer.train_step(train_dataset_iter)
trainer.eval_step(eval_dataset_iter)
if __name__ == '__main__':
tf.test.main()
# Copyright 2021 The TensorFlow Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# Copyright 2021 The TensorFlow Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Customized MobileBERT-EdgeTPU layers.
There are two reasons for us to customize the layers instead of using the well-
defined layers used in baseline MobileBERT.
1. The layer introduces compiler sharding failures. For example, the gather in
OnDeviceEmbedding.
2. The layer contains ops that need to have bounded input/output ranges. For
example, softmax op.
"""
import string
import numpy as np
import tensorflow as tf
from official.nlp.modeling import layers
_CHR_IDX = string.ascii_lowercase
# This function is directly copied from the tf.keras.layers.MultiHeadAttention
# implementation.
def _build_attention_equation(rank, attn_axes):
"""Builds einsum equations for the attention computation.
Query, key, value inputs after projection are expected to have the shape as:
`(bs, <non-attention dims>, <attention dims>, num_heads, channels)`.
`bs` and `<non-attention dims>` are treated as `<batch dims>`.
The attention operations can be generalized:
(1) Query-key dot product:
`(<batch dims>, <query attention dims>, num_heads, channels), (<batch dims>,
<key attention dims>, num_heads, channels) -> (<batch dims>,
num_heads, <query attention dims>, <key attention dims>)`
(2) Combination:
`(<batch dims>, num_heads, <query attention dims>, <key attention dims>),
(<batch dims>, <value attention dims>, num_heads, channels) -> (<batch dims>,
<query attention dims>, num_heads, channels)`
Args:
rank: Rank of query, key, value tensors.
attn_axes: List/tuple of axes, `[-1, rank)`,
that attention will be applied to.
Returns:
Einsum equations.
"""
target_notation = _CHR_IDX[:rank]
# `batch_dims` includes the head dim.
batch_dims = tuple(np.delete(range(rank), attn_axes + (rank - 1,)))
letter_offset = rank
source_notation = ''
for i in range(rank):
if i in batch_dims or i == rank - 1:
source_notation += target_notation[i]
else:
source_notation += _CHR_IDX[letter_offset]
letter_offset += 1
product_notation = ''.join([target_notation[i] for i in batch_dims] +
[target_notation[i] for i in attn_axes] +
[source_notation[i] for i in attn_axes])
dot_product_equation = '%s,%s->%s' % (source_notation, target_notation,
product_notation)
attn_scores_rank = len(product_notation)
combine_equation = '%s,%s->%s' % (product_notation, source_notation,
target_notation)
return dot_product_equation, combine_equation, attn_scores_rank
@tf.keras.utils.register_keras_serializable(package='Text')
class EdgeTPUSoftmax(tf.keras.layers.Softmax):
"""EdgeTPU/Quantization friendly implementation for the SoftMax.
When export quant model, use -120 mask value.
When export float model and run inference with bf16 on device, use -10000.
"""
def __init__(self,
mask_value: int = -120,
**kwargs):
self._mask_value = mask_value
super(EdgeTPUSoftmax, self).__init__(**kwargs)
def get_config(self):
config = {
'mask_value': self._mask_value
}
base_config = super(EdgeTPUSoftmax, self).get_config()
return dict(list(base_config.items()) + list(config.items()))
def call(self, inputs, mask=None):
if mask is not None:
adder = (1.0 - tf.cast(mask, inputs.dtype)) * self._mask_value
inputs += adder
if isinstance(self.axis, (tuple, list)):
if len(self.axis) > 1:
return tf.exp(inputs - tf.reduce_logsumexp(
inputs, axis=self.axis, keepdims=True))
else:
return tf.keras.backend.softmax(inputs, axis=self.axis[0])
return tf.keras.backend.softmax(inputs, axis=self.axis)
@tf.keras.utils.register_keras_serializable(package='Text')
class EdgeTPUMultiHeadAttention(tf.keras.layers.MultiHeadAttention):
"""Quantization friendly implementation for the MultiHeadAttention."""
def _build_attention(self, rank):
"""Builds multi-head dot-product attention computations.
This function builds attributes necessary for `_compute_attention` to
costomize attention computation to replace the default dot-product
attention.
Args:
rank: the rank of query, key, value tensors.
"""
if self._attention_axes is None:
self._attention_axes = tuple(range(1, rank - 2))
else:
self._attention_axes = tuple(self._attention_axes)
self._dot_product_equation, self._combine_equation, attn_scores_rank = (
_build_attention_equation(
rank, attn_axes=self._attention_axes))
norm_axes = tuple(
range(attn_scores_rank - len(self._attention_axes), attn_scores_rank))
self._softmax = EdgeTPUSoftmax(axis=norm_axes)
self._dropout_layer = tf.keras.layers.Dropout(rate=self._dropout)
class EdgetpuMobileBertTransformer(layers.MobileBertTransformer):
"""Quantization friendly MobileBertTransformer.
Inherits from the MobileBertTransformer but use our customized MHA.
"""
def __init__(self, **kwargs):
super(EdgetpuMobileBertTransformer, self).__init__(**kwargs)
attention_head_size = int(
self.intra_bottleneck_size / self.num_attention_heads)
attention_layer = EdgeTPUMultiHeadAttention(
num_heads=self.num_attention_heads,
key_dim=attention_head_size,
value_dim=attention_head_size,
dropout=self.attention_probs_dropout_prob,
output_shape=self.intra_bottleneck_size,
kernel_initializer=self.initializer,
name='attention')
layer_norm = self.block_layers['attention'][1]
self.block_layers['attention'] = [attention_layer, layer_norm]
# Copyright 2021 The TensorFlow Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Tests for custom layers used by MobileBERT-EdgeTPU."""
from absl.testing import parameterized
import numpy as np
import tensorflow as tf
from official.projects.edgetpu.nlp.modeling import edgetpu_layers
keras = tf.keras
class MultiHeadAttentionTest(tf.test.TestCase, parameterized.TestCase):
@parameterized.named_parameters(
("key_value_same_proj", None, None, [40, 80]),
("key_value_different_proj", 32, 60, [40, 60]),
)
def test_non_masked_attention(self, value_dim, output_shape, output_dims):
"""Test that the attention layer can be created without a mask tensor."""
test_layer = edgetpu_layers.EdgeTPUMultiHeadAttention(
num_heads=12,
key_dim=64,
value_dim=value_dim,
output_shape=output_shape)
# Create a 3-dimensional input (the first dimension is implicit).
query = keras.Input(shape=(40, 80))
value = keras.Input(shape=(20, 80))
output = test_layer(query=query, value=value)
self.assertEqual(output.shape.as_list(), [None] + output_dims)
def test_non_masked_self_attention(self):
"""Test with one input (self-attenntion) and no mask tensor."""
test_layer = edgetpu_layers.EdgeTPUMultiHeadAttention(
num_heads=12, key_dim=64)
# Create a 3-dimensional input (the first dimension is implicit).
query = keras.Input(shape=(40, 80))
output = test_layer(query, query)
self.assertEqual(output.shape.as_list(), [None, 40, 80])
def test_attention_scores(self):
"""Test attention outputs with coefficients."""
test_layer = edgetpu_layers.EdgeTPUMultiHeadAttention(
num_heads=12, key_dim=64)
# Create a 3-dimensional input (the first dimension is implicit).
query = keras.Input(shape=(40, 80))
output, coef = test_layer(query, query, return_attention_scores=True)
self.assertEqual(output.shape.as_list(), [None, 40, 80])
self.assertEqual(coef.shape.as_list(), [None, 12, 40, 40])
def test_attention_scores_with_values(self):
"""Test attention outputs with coefficients."""
test_layer = edgetpu_layers.EdgeTPUMultiHeadAttention(
num_heads=12, key_dim=64)
# Create a 3-dimensional input (the first dimension is implicit).
query = keras.Input(shape=(40, 80))
value = keras.Input(shape=(60, 80))
output, coef = test_layer(query, value, return_attention_scores=True)
self.assertEqual(output.shape.as_list(), [None, 40, 80])
self.assertEqual(coef.shape.as_list(), [None, 12, 40, 60])
@parameterized.named_parameters(("with_bias", True), ("no_bias", False))
def test_masked_attention(self, use_bias):
"""Test with a mask tensor."""
test_layer = edgetpu_layers.EdgeTPUMultiHeadAttention(
num_heads=2, key_dim=2, use_bias=use_bias)
# Create a 3-dimensional input (the first dimension is implicit).
batch_size = 3
query = keras.Input(shape=(4, 8))
value = keras.Input(shape=(2, 8))
mask_tensor = keras.Input(shape=(4, 2))
output = test_layer(query=query, value=value, attention_mask=mask_tensor)
# Create a model containing the test layer.
model = keras.Model([query, value, mask_tensor], output)
# Generate data for the input (non-mask) tensors.
from_data = 10 * np.random.random_sample((batch_size, 4, 8))
to_data = 10 * np.random.random_sample((batch_size, 2, 8))
# Invoke the data with a random set of mask data. This should mask at least
# one element.
mask_data = np.random.randint(2, size=(batch_size, 4, 2))
masked_output_data = model.predict([from_data, to_data, mask_data])
# Invoke the same data, but with a null mask (where no elements are masked).
null_mask_data = np.ones((batch_size, 4, 2))
unmasked_output_data = model.predict([from_data, to_data, null_mask_data])
# Because one data is masked and one is not, the outputs should not be the
# same.
self.assertNotAllClose(masked_output_data, unmasked_output_data)
# Tests the layer with three inputs: Q, K, V.
key = keras.Input(shape=(2, 8))
output = test_layer(query, value=value, key=key, attention_mask=mask_tensor)
model = keras.Model([query, value, key, mask_tensor], output)
masked_output_data = model.predict([from_data, to_data, to_data, mask_data])
unmasked_output_data = model.predict(
[from_data, to_data, to_data, null_mask_data])
# Because one data is masked and one is not, the outputs should not be the
# same.
self.assertNotAllClose(masked_output_data, unmasked_output_data)
if use_bias:
self.assertLen(test_layer._query_dense.trainable_variables, 2)
self.assertLen(test_layer._output_dense.trainable_variables, 2)
else:
self.assertLen(test_layer._query_dense.trainable_variables, 1)
self.assertLen(test_layer._output_dense.trainable_variables, 1)
def test_initializer(self):
"""Test with a specified initializer."""
test_layer = edgetpu_layers.EdgeTPUMultiHeadAttention(
num_heads=12,
key_dim=64,
kernel_initializer=keras.initializers.TruncatedNormal(stddev=0.02))
# Create a 3-dimensional input (the first dimension is implicit).
query = keras.Input(shape=(40, 80))
output = test_layer(query, query)
self.assertEqual(output.shape.as_list(), [None, 40, 80])
def test_masked_attention_with_scores(self):
"""Test with a mask tensor."""
test_layer = edgetpu_layers.EdgeTPUMultiHeadAttention(
num_heads=2, key_dim=2)
# Create a 3-dimensional input (the first dimension is implicit).
batch_size = 3
query = keras.Input(shape=(4, 8))
value = keras.Input(shape=(2, 8))
mask_tensor = keras.Input(shape=(4, 2))
output = test_layer(query=query, value=value, attention_mask=mask_tensor)
# Create a model containing the test layer.
model = keras.Model([query, value, mask_tensor], output)
# Generate data for the input (non-mask) tensors.
from_data = 10 * np.random.random_sample((batch_size, 4, 8))
to_data = 10 * np.random.random_sample((batch_size, 2, 8))
# Invoke the data with a random set of mask data. This should mask at least
# one element.
mask_data = np.random.randint(2, size=(batch_size, 4, 2))
masked_output_data = model.predict([from_data, to_data, mask_data])
# Invoke the same data, but with a null mask (where no elements are masked).
null_mask_data = np.ones((batch_size, 4, 2))
unmasked_output_data = model.predict([from_data, to_data, null_mask_data])
# Because one data is masked and one is not, the outputs should not be the
# same.
self.assertNotAllClose(masked_output_data, unmasked_output_data)
# Create a model containing attention scores.
output, scores = test_layer(
query=query, value=value, attention_mask=mask_tensor,
return_attention_scores=True)
model = keras.Model([query, value, mask_tensor], [output, scores])
masked_output_data_score, masked_score = model.predict(
[from_data, to_data, mask_data])
unmasked_output_data_score, unmasked_score = model.predict(
[from_data, to_data, null_mask_data])
self.assertNotAllClose(masked_output_data_score, unmasked_output_data_score)
self.assertAllClose(masked_output_data, masked_output_data_score)
self.assertAllClose(unmasked_output_data, unmasked_output_data_score)
self.assertNotAllClose(masked_score, unmasked_score)
@parameterized.named_parameters(
("4d_inputs_1freebatch_mask2", [3, 4], [3, 2], [4, 2],
(2,)), ("4d_inputs_1freebatch_mask3", [3, 4], [3, 2], [3, 4, 2], (2,)),
("4d_inputs_1freebatch_mask4", [3, 4], [3, 2], [3, 2, 4, 2],
(2,)), ("4D_inputs_2D_attention", [3, 4], [3, 2], [3, 4, 3, 2], (1, 2)),
("5D_inputs_2D_attention", [5, 3, 4], [5, 3, 2], [3, 4, 3, 2], (2, 3)),
("5D_inputs_2D_attention_fullmask", [5, 3, 4], [5, 3, 2], [5, 3, 4, 3, 2],
(2, 3)))
def test_high_dim_attention(self, q_dims, v_dims, mask_dims, attention_axes):
"""Test with a mask tensor."""
test_layer = edgetpu_layers.EdgeTPUMultiHeadAttention(
num_heads=2, key_dim=2, attention_axes=attention_axes)
batch_size, hidden_size = 3, 8
# Generate data for the input (non-mask) tensors.
query_shape = [batch_size] + q_dims + [hidden_size]
value_shape = [batch_size] + v_dims + [hidden_size]
mask_shape = [batch_size] + mask_dims
query = 10 * np.random.random_sample(query_shape)
value = 10 * np.random.random_sample(value_shape)
# Invoke the data with a random set of mask data. This should mask at least
# one element.
mask_data = np.random.randint(2, size=mask_shape).astype("bool")
# Invoke the same data, but with a null mask (where no elements are masked).
null_mask_data = np.ones(mask_shape)
# Because one data is masked and one is not, the outputs should not be the
# same.
query_tensor = keras.Input(query_shape[1:], name="query")
value_tensor = keras.Input(value_shape[1:], name="value")
mask_tensor = keras.Input(mask_shape[1:], name="mask")
output = test_layer(query=query_tensor, value=value_tensor,
attention_mask=mask_tensor)
model = keras.Model([query_tensor, value_tensor, mask_tensor], output)
self.assertNotAllClose(
model.predict([query, value, mask_data]),
model.predict([query, value, null_mask_data]))
def test_dropout(self):
test_layer = edgetpu_layers.EdgeTPUMultiHeadAttention(
num_heads=2, key_dim=2, dropout=0.5)
# Generate data for the input (non-mask) tensors.
from_data = keras.backend.ones(shape=(32, 4, 8))
to_data = keras.backend.ones(shape=(32, 2, 8))
train_out = test_layer(from_data, to_data, None, None, None, True)
test_out = test_layer(from_data, to_data, None, None, None, False)
# Output should be close when not in training mode,
# and should not be close when enabling dropout in training mode.
self.assertNotAllClose(
keras.backend.eval(train_out),
keras.backend.eval(test_out))
if __name__ == "__main__":
tf.test.main()
# Copyright 2021 The TensorFlow Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""MobileBERT text encoder network."""
import tensorflow as tf
from official.nlp import modeling
from official.nlp.modeling import layers
from official.projects.edgetpu.nlp.modeling import edgetpu_layers
@tf.keras.utils.register_keras_serializable(package='Text')
class MobileBERTEncoder(tf.keras.Model):
"""A Keras functional API implementation for MobileBERT encoder."""
def __init__(self,
word_vocab_size=30522,
word_embed_size=128,
type_vocab_size=2,
max_sequence_length=512,
num_blocks=24,
hidden_size=512,
num_attention_heads=4,
intermediate_size=512,
intermediate_act_fn='relu',
hidden_dropout_prob=0.1,
attention_probs_dropout_prob=0.1,
intra_bottleneck_size=128,
initializer_range=0.02,
use_bottleneck_attention=False,
key_query_shared_bottleneck=True,
num_feedforward_networks=4,
normalization_type='no_norm',
classifier_activation=False,
input_mask_dtype='int32',
quantization_friendly=True,
**kwargs):
"""Class initialization.
Args:
word_vocab_size: Number of words in the vocabulary.
word_embed_size: Word embedding size.
type_vocab_size: Number of word types.
max_sequence_length: Maximum length of input sequence.
num_blocks: Number of transformer block in the encoder model.
hidden_size: Hidden size for the transformer block.
num_attention_heads: Number of attention heads in the transformer block.
intermediate_size: The size of the "intermediate" (a.k.a., feed
forward) layer.
intermediate_act_fn: The non-linear activation function to apply
to the output of the intermediate/feed-forward layer.
hidden_dropout_prob: Dropout probability for the hidden layers.
attention_probs_dropout_prob: Dropout probability of the attention
probabilities.
intra_bottleneck_size: Size of bottleneck.
initializer_range: The stddev of the `truncated_normal_initializer` for
initializing all weight matrices.
use_bottleneck_attention: Use attention inputs from the bottleneck
transformation. If true, the following `key_query_shared_bottleneck`
will be ignored.
key_query_shared_bottleneck: Whether to share linear transformation for
keys and queries.
num_feedforward_networks: Number of stacked feed-forward networks.
normalization_type: The type of normalization_type, only `no_norm` and
`layer_norm` are supported. `no_norm` represents the element-wise linear
transformation for the student model, as suggested by the original
MobileBERT paper. `layer_norm` is used for the teacher model.
classifier_activation: If using the tanh activation for the final
representation of the `[CLS]` token in fine-tuning.
input_mask_dtype: The dtype of `input_mask` tensor, which is one of the
input tensors of this encoder. Defaults to `int32`. If you want
to use `tf.lite` quantization, which does not support `Cast` op,
please set this argument to `tf.float32` and feed `input_mask`
tensor with values in `float32` to avoid `tf.cast` in the computation.
quantization_friendly: If enabled, the model calss EdgeTPU mobile
transformer. The difference is we have a customized softmax
ops which use -120 as the mask value, which is more stable for post-
training quantization.
**kwargs: Other keyworded and arguments.
"""
self._self_setattr_tracking = False
initializer = tf.keras.initializers.TruncatedNormal(
stddev=initializer_range)
# layer instantiation
self.embedding_layer = layers.MobileBertEmbedding(
word_vocab_size=word_vocab_size,
word_embed_size=word_embed_size,
type_vocab_size=type_vocab_size,
output_embed_size=hidden_size,
max_sequence_length=max_sequence_length,
normalization_type=normalization_type,
initializer=initializer,
dropout_rate=hidden_dropout_prob)
self._transformer_layers = []
transformer_layer_args = dict(
hidden_size=hidden_size,
num_attention_heads=num_attention_heads,
intermediate_size=intermediate_size,
intermediate_act_fn=intermediate_act_fn,
hidden_dropout_prob=hidden_dropout_prob,
attention_probs_dropout_prob=attention_probs_dropout_prob,
intra_bottleneck_size=intra_bottleneck_size,
use_bottleneck_attention=use_bottleneck_attention,
key_query_shared_bottleneck=key_query_shared_bottleneck,
num_feedforward_networks=num_feedforward_networks,
normalization_type=normalization_type,
initializer=initializer,
)
for layer_idx in range(num_blocks):
if quantization_friendly:
transformer = edgetpu_layers.EdgetpuMobileBertTransformer(
name=f'transformer_layer_{layer_idx}',
**transformer_layer_args)
else:
transformer = layers.MobileBertTransformer(
name=f'transformer_layer_{layer_idx}',
**transformer_layer_args)
self._transformer_layers.append(transformer)
# input tensor
input_ids = tf.keras.layers.Input(
shape=(None,), dtype=tf.int32, name='input_word_ids')
type_ids = tf.keras.layers.Input(
shape=(None,), dtype=tf.int32, name='input_type_ids')
input_mask = tf.keras.layers.Input(
shape=(None,), dtype=input_mask_dtype, name='input_mask')
self.inputs = [input_ids, input_mask, type_ids]
# The dtype of `attention_mask` will the same as the dtype of `input_mask`.
attention_mask = modeling.layers.SelfAttentionMask()(input_mask, input_mask)
# build the computation graph
all_layer_outputs = []
all_attention_scores = []
embedding_output = self.embedding_layer(input_ids, type_ids)
all_layer_outputs.append(embedding_output)
prev_output = embedding_output
for layer_idx in range(num_blocks):
layer_output, attention_score = self._transformer_layers[layer_idx](
prev_output,
attention_mask,
return_attention_scores=True)
all_layer_outputs.append(layer_output)
all_attention_scores.append(attention_score)
prev_output = layer_output
first_token = tf.squeeze(prev_output[:, 0:1, :], axis=1)
if classifier_activation:
self._pooler_layer = tf.keras.layers.experimental.EinsumDense(
'ab,bc->ac',
output_shape=hidden_size,
activation=tf.tanh,
bias_axes='c',
kernel_initializer=initializer,
name='pooler')
first_token = self._pooler_layer(first_token)
else:
self._pooler_layer = None
outputs = dict(
sequence_output=prev_output,
pooled_output=first_token,
encoder_outputs=all_layer_outputs,
attention_scores=all_attention_scores)
super(MobileBERTEncoder, self).__init__(
inputs=self.inputs, outputs=outputs, **kwargs)
def get_embedding_table(self):
return self.embedding_layer.word_embedding.embeddings
def get_embedding_layer(self):
return self.embedding_layer.word_embedding
@property
def transformer_layers(self):
"""List of Transformer layers in the encoder."""
return self._transformer_layers
@property
def pooler_layer(self):
"""The pooler dense layer after the transformer layers."""
return self._pooler_layer
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment