"vscode:/vscode.git/clone" did not exist on "40cb07b4b9d66682b787b7825d22f01df80ce022"
Unverified Commit 44f6d511 authored by Srihari Humbarwadi's avatar Srihari Humbarwadi Committed by GitHub
Browse files

Merge branch 'tensorflow:master' into panoptic-deeplab

parents 686a287d 8bc5a1a5
......@@ -3,7 +3,8 @@
</div>
[![Python](https://img.shields.io/pypi/pyversions/tensorflow.svg?style=plastic)](https://badge.fury.io/py/tensorflow)
[![PyPI](https://badge.fury.io/py/tensorflow.svg)](https://badge.fury.io/py/tensorflow)
[![tf-models-official PyPI](https://badge.fury.io/py/tf-models-official.svg)](https://badge.fury.io/py/tf-models-official)
# Welcome to the Model Garden for TensorFlow
......@@ -32,7 +33,8 @@ To install the current release of tensorflow-models, please follow any one of th
<details>
**tf-models-official** is the stable Model Garden package.
**tf-models-official** is the stable Model Garden package. Please check out the [releases](https://github.com/tensorflow/models/releases) to see what are available modules.
pip will install all models and dependencies automatically.
```shell
......
......@@ -19,7 +19,7 @@ This repository provides a curated list of the GitHub repositories with machine
| [ResNet 101](https://github.com/IntelAI/models/tree/master/benchmarks/image_recognition/tensorflow/resnet101) | [Deep Residual Learning for Image Recognition](https://arxiv.org/pdf/1512.03385) | • Int8 Inference<br/>• FP32 Inference | [Intel](https://github.com/IntelAI) |
| [ResNet 50](https://github.com/IntelAI/models/tree/master/benchmarks/image_recognition/tensorflow/resnet50) | [Deep Residual Learning for Image Recognition](https://arxiv.org/pdf/1512.03385) | • Int8 Inference<br/>• FP32 Inference | [Intel](https://github.com/IntelAI) |
| [ResNet 50v1.5](https://github.com/IntelAI/models/tree/master/benchmarks/image_recognition/tensorflow/resnet50v1_5) | [Deep Residual Learning for Image Recognition](https://arxiv.org/pdf/1512.03385) | • Int8 Inference<br/>• FP32 Inference<br/>• FP32 Training | [Intel](https://github.com/IntelAI) |
| [EfficientNet](https://github.com/NVIDIA/DeepLearningExamples/tree/master/TensorFlow2/Classification/ConvNets/efficientnet) | [EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks](https://arxiv.org/pdf/1905.11946.pdf) | • Automatic mixed precision<br/>• Horovod Multi-GPU training (NCCL)<br/>• Multi-node training on a Pyxis/Enroot Slurm cluster<br/>• XLA | [NVIDIA](https://github.com/NVIDIA) |
| EfficientNet [v1](https://github.com/NVIDIA/DeepLearningExamples/tree/master/TensorFlow2/Classification/ConvNets/efficientnet_v1) [v2](https://github.com/NVIDIA/DeepLearningExamples/tree/master/TensorFlow2/Classification/ConvNets/efficientnet_v2) | [EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks](https://arxiv.org/pdf/1905.11946.pdf) | • Automatic mixed precision<br/>• Horovod Multi-GPU training (NCCL)<br/>• Multi-node training on a Pyxis/Enroot Slurm cluster<br/>• XLA | [NVIDIA](https://github.com/NVIDIA) |
### Object Detection
......
......@@ -38,16 +38,15 @@ In the near future, we will add:
## Models and Implementations
### Computer Vision
### [Computer Vision](vision/README.md)
#### Image Classification
| Model | Reference (Paper) |
|-------|-------------------|
| [MNIST](legacy/image_classification) | A basic model to classify digits from the [MNIST dataset](http://yann.lecun.com/exdb/mnist/) |
| [ResNet](vision/MODEL_GARDEN.md) | [Deep Residual Learning for Image Recognition](https://arxiv.org/abs/1512.03385) |
| [ResNet-RS](vision/MODEL_GARDEN.md) | [Revisiting ResNets: Improved Training and Scaling Strategies](https://arxiv.org/abs/2103.07579) |
| [EfficientNet](legacy/image_classification) | [EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks](https://arxiv.org/abs/1905.11946) |
| [EfficientNet](vision/MODEL_GARDEN.md) | [EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks](https://arxiv.org/abs/1905.11946) |
| [Vision Transformer](vision/MODEL_GARDEN.md) | [An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://arxiv.org/abs/2010.11929) |
#### Object Detection and Segmentation
......@@ -56,7 +55,6 @@ In the near future, we will add:
|-------|-------------------|
| [RetinaNet](vision/MODEL_GARDEN.md) | [Focal Loss for Dense Object Detection](https://arxiv.org/abs/1708.02002) |
| [Mask R-CNN](vision/MODEL_GARDEN.md) | [Mask R-CNN](https://arxiv.org/abs/1703.06870) |
| [ShapeMask](legacy/detection) | [ShapeMask: Learning to Segment Novel Objects by Refining Shape Priors](https://arxiv.org/abs/1904.03239) |
| [SpineNet](vision/MODEL_GARDEN.md) | [SpineNet: Learning Scale-Permuted Backbone for Recognition and Localization](https://arxiv.org/abs/1912.05027) |
| [Cascade RCNN-RS and RetinaNet-RS](vision/MODEL_GARDEN.md) | [Simple Training Strategies and Model Scaling for Object Detection](https://arxiv.org/abs/2107.00057)|
......@@ -66,7 +64,7 @@ In the near future, we will add:
|-------|-------------------|
| [Mobile Video Networks (MoViNets)](projects/movinet) | [MoViNets: Mobile Video Networks for Efficient Video Recognition](https://arxiv.org/abs/2103.11511) |
### Natural Language Processing
### [Natural Language Processing](nlp/README.md)
| Model | Reference (Paper) |
|-------|-------------------|
......@@ -74,7 +72,6 @@ In the near future, we will add:
| [BERT (Bidirectional Encoder Representations from Transformers)](nlp/MODEL_GARDEN.md#available-model-configs) | [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805) |
| [NHNet (News Headline generation model)](projects/nhnet) | [Generating Representative Headlines for News Stories](https://arxiv.org/abs/2001.09386) |
| [Transformer](nlp/MODEL_GARDEN.md#available-model-configs) | [Attention Is All You Need](https://arxiv.org/abs/1706.03762) |
| [XLNet](nlp/xlnet) | [XLNet: Generalized Autoregressive Pretraining for Language Understanding](https://arxiv.org/abs/1906.08237) |
| [MobileBERT](projects/mobilebert) | [MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices](https://arxiv.org/abs/2004.02984) |
### Recommendation
......
......@@ -34,14 +34,10 @@
{
"cell_type": "markdown",
"metadata": {
"id": "fsACVQpVSifi"
"id": "2X-XaMSVcLua"
},
"source": [
"### Install the TensorFlow Model Garden pip package\n",
"\n",
"* `tf-models-official` is the stable Model Garden package. Note that it may not include the latest changes in the `tensorflow_models` github repo. To include latest changes, you may install `tf-models-nightly`,\n",
"which is the nightly Model Garden package created daily automatically.\n",
"* pip will install all models and dependencies automatically."
"# Decoding API"
]
},
{
......@@ -66,6 +62,30 @@
"\u003c/table\u003e"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "fsACVQpVSifi"
},
"source": [
"### Install the TensorFlow Model Garden pip package\n",
"\n",
"* `tf-models-official` is the stable Model Garden package. Note that it may not include the latest changes in the `tensorflow_models` github repo. To include latest changes, you may install `tf-models-nightly`,\n",
"which is the nightly Model Garden package created daily automatically.\n",
"* pip will install all models and dependencies automatically."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "G4BhAu01HZcM"
},
"outputs": [],
"source": [
"!pip uninstall -y opencv-python"
]
},
{
"cell_type": "code",
"execution_count": null,
......@@ -74,7 +94,7 @@
},
"outputs": [],
"source": [
"pip install tf-models-nightly"
"!pip install tf-models-official"
]
},
{
......@@ -92,9 +112,20 @@
"\n",
"import tensorflow as tf\n",
"\n",
"from official import nlp\n",
"from official.nlp.modeling.ops import sampling_module\n",
"from official.nlp.modeling.ops import beam_search"
"from tensorflow_models import nlp"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "T92ccAzlnGqh"
},
"outputs": [],
"source": [
"def length_norm(length, dtype):\n",
" \"\"\"Return length normalization factor.\"\"\"\n",
" return tf.pow(((5. + tf.cast(length, dtype)) / 6.), 0.0)"
]
},
{
......@@ -103,7 +134,8 @@
"id": "0AWgyo-IQ5sP"
},
"source": [
"# Decoding API\n",
"## Overview\n",
"\n",
"This API provides an interface to experiment with different decoding strategies used for auto-regressive models.\n",
"\n",
"1. The following sampling strategies are provided in sampling_module.py, which inherits from the base Decoding class:\n",
......@@ -182,7 +214,7 @@
"id": "lV1RRp6ihnGX"
},
"source": [
"# Initialize the Model Hyper-parameters"
"## Initialize the Model Hyper-parameters"
]
},
{
......@@ -193,44 +225,32 @@
},
"outputs": [],
"source": [
"params = {}\n",
"params['num_heads'] = 2\n",
"params['num_layers'] = 2\n",
"params['batch_size'] = 2\n",
"params['n_dims'] = 256\n",
"params['max_decode_length'] = 4"
"params = {\n",
" 'num_heads': 2\n",
" 'num_layers': 2\n",
" 'batch_size': 2\n",
" 'n_dims': 256\n",
" 'max_decode_length': 4}"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "UGvmd0_dRFYI"
"id": "CYXkoplAij01"
},
"source": [
"## What is a Cache?\n",
"In auto-regressive architectures like Transformer based [Encoder-Decoder](https://arxiv.org/abs/1706.03762) models, \n",
"Cache is used for fast sequential decoding.\n",
"It is a nested dictionary storing pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention blocks) for every layer.\n",
"\n",
"```\n",
"{\n",
" 'layer_%d' % layer: {\n",
" 'k': tf.zeros([params['batch_size'], params['max_decode_length'], params['num_heads'], params['n_dims']/params['num_heads']], dtype=tf.float32),\n",
" 'v': tf.zeros([params['batch_size'], params['max_decode_length'], params['num_heads'], params['n_dims']/params['num_heads']], dtype=tf.float32)\n",
" } for layer in range(params['num_layers']),\n",
" 'model_specific_item' : Model specific tensor shape,\n",
"}\n",
"\n",
"```"
"## Initialize cache. "
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "CYXkoplAij01"
"id": "UGvmd0_dRFYI"
},
"source": [
"# Initialize cache. "
"In auto-regressive architectures like Transformer based [Encoder-Decoder](https://arxiv.org/abs/1706.03762) models, \n",
"Cache is used for fast sequential decoding.\n",
"It is a nested dictionary storing pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention blocks) for every layer."
]
},
{
......@@ -243,35 +263,15 @@
"source": [
"cache = {\n",
" 'layer_%d' % layer: {\n",
" 'k': tf.zeros([params['batch_size'], params['max_decode_length'], params['num_heads'], params['n_dims']/params['num_heads']], dtype=tf.float32),\n",
" 'v': tf.zeros([params['batch_size'], params['max_decode_length'], params['num_heads'], params['n_dims']/params['num_heads']], dtype=tf.float32)\n",
" 'k': tf.zeros(\n",
" shape=[params['batch_size'], params['max_decode_length'], params['num_heads'], params['n_dims'] // params['num_heads']],\n",
" dtype=tf.float32),\n",
" 'v': tf.zeros(\n",
" shape=[params['batch_size'], params['max_decode_length'], params['num_heads'], params['n_dims'] // params['num_heads']],\n",
" dtype=tf.float32)\n",
" } for layer in range(params['num_layers'])\n",
" }\n",
"print(\"cache key shape for layer 1 :\", cache['layer_1']['k'].shape)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "nNY3Xn8SiblP"
},
"source": [
"# Define closure for length normalization. **optional.**\n",
"\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "T92ccAzlnGqh"
},
"outputs": [],
"source": [
"def length_norm(length, dtype):\n",
" \"\"\"Return length normalization factor.\"\"\"\n",
" return tf.pow(((5. + tf.cast(length, dtype)) / 6.), 0.0)"
"print(\"cache value shape for layer 1 :\", cache['layer_1']['k'].shape)"
]
},
{
......@@ -280,15 +280,14 @@
"id": "syl7I5nURPgW"
},
"source": [
"# Create model_fn\n",
"### Create model_fn\n",
" In practice, this will be replaced by an actual model implementation such as [here](https://github.com/tensorflow/models/blob/master/official/nlp/transformer/transformer.py#L236)\n",
"```\n",
"Args:\n",
"i : Step that is being decoded.\n",
"Returns:\n",
" logit probabilities of size [batch_size, 1, vocab_size]\n",
"```\n",
"\n"
"```\n"
]
},
{
......@@ -307,15 +306,6 @@
" return probabilities[:, i, :]"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "DBMUkaVmVZBg"
},
"source": [
"# Initialize symbols_to_logits_fn\n"
]
},
{
"cell_type": "code",
"execution_count": null,
......@@ -339,7 +329,7 @@
"id": "R_tV3jyWVL47"
},
"source": [
"# Greedy \n",
"## Greedy \n",
"Greedy decoding selects the token id with the highest probability as its next id: $id_t = argmax_{w}P(id | id_{1:t-1})$ at each timestep $t$. The following sketch shows greedy decoding. "
]
},
......@@ -370,7 +360,7 @@
"id": "s4pTTsQXVz5O"
},
"source": [
"# top_k sampling\n",
"## top_k sampling\n",
"In *Top-K* sampling, the *K* most likely next token ids are filtered and the probability mass is redistributed among only those *K* ids. "
]
},
......@@ -404,7 +394,7 @@
"id": "Jp3G-eE_WI4Y"
},
"source": [
"# top_p sampling\n",
"## top_p sampling\n",
"Instead of sampling only from the most likely *K* token ids, in *Top-p* sampling chooses from the smallest possible set of ids whose cumulative probability exceeds the probability *p*."
]
},
......@@ -438,7 +428,7 @@
"id": "2hcuyJ2VWjDz"
},
"source": [
"# Beam search decoding\n",
"## Beam search decoding\n",
"Beam search reduces the risk of missing hidden high probability token ids by keeping the most likely num_beams of hypotheses at each time step and eventually choosing the hypothesis that has the overall highest probability. "
]
},
......
This diff is collapsed.
......@@ -95,6 +95,19 @@
"* `pip` will install all models and dependencies automatically."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "IAOmYthAzI7J"
},
"outputs": [],
"source": [
"# Uninstall colab's opencv-python, it conflicts with `opencv-python-headless`\n",
"# which is installed by tf-models-official\n",
"!pip uninstall -y opencv-python"
]
},
{
"cell_type": "code",
"execution_count": null,
......@@ -103,7 +116,7 @@
},
"outputs": [],
"source": [
"!pip install -q tf-models-official==2.4.0"
"!pip install tf-models-official"
]
},
{
......@@ -126,8 +139,7 @@
"import numpy as np\n",
"import tensorflow as tf\n",
"\n",
"from official.nlp import modeling\n",
"from official.nlp.modeling import layers, losses, models, networks"
"from tensorflow_models import nlp"
]
},
{
......@@ -151,9 +163,9 @@
"source": [
"### Build a `BertPretrainer` model wrapping `BertEncoder`\n",
"\n",
"The [BertEncoder](https://github.com/tensorflow/models/blob/master/official/nlp/modeling/networks/bert_encoder.py) implements the Transformer-based encoder as described in [BERT paper](https://arxiv.org/abs/1810.04805). It includes the embedding lookups and transformer layers, but not the masked language model or classification task networks.\n",
"The `nlp.networks.BertEncoder` class implements the Transformer-based encoder as described in [BERT paper](https://arxiv.org/abs/1810.04805). It includes the embedding lookups and transformer layers (`nlp.layers.TransformerEncoderBlock`), but not the masked language model or classification task networks.\n",
"\n",
"The [BertPretrainer](https://github.com/tensorflow/models/blob/master/official/nlp/modeling/models/bert_pretrainer.py) allows a user to pass in a transformer stack, and instantiates the masked language model and classification networks that are used to create the training objectives."
"The `nlp.models.BertPretrainer` class allows a user to pass in a transformer stack, and instantiates the masked language model and classification networks that are used to create the training objectives."
]
},
{
......@@ -166,9 +178,10 @@
"source": [
"# Build a small transformer network.\n",
"vocab_size = 100\n",
"sequence_length = 16\n",
"network = modeling.networks.BertEncoder(\n",
" vocab_size=vocab_size, num_layers=2, sequence_length=16)"
"network = nlp.networks.BertEncoder(\n",
" vocab_size=vocab_size, \n",
" # The number of TransformerEncoderBlock layers\n",
" num_layers=3)"
]
},
{
......@@ -177,7 +190,7 @@
"id": "0NH5irV5KTMS"
},
"source": [
"Inspecting the encoder, we see it contains few embedding layers, stacked `Transformer` layers and are connected to three input layers:\n",
"Inspecting the encoder, we see it contains few embedding layers, stacked `nlp.layers.TransformerEncoderBlock` layers and are connected to three input layers:\n",
"\n",
"`input_word_ids`, `input_type_ids` and `input_mask`.\n"
]
......@@ -190,7 +203,7 @@
},
"outputs": [],
"source": [
"tf.keras.utils.plot_model(network, show_shapes=True, dpi=48)"
"tf.keras.utils.plot_model(network, show_shapes=True, expand_nested=True, dpi=48)"
]
},
{
......@@ -203,7 +216,7 @@
"source": [
"# Create a BERT pretrainer with the created network.\n",
"num_token_predictions = 8\n",
"bert_pretrainer = modeling.models.BertPretrainer(\n",
"bert_pretrainer = nlp.models.BertPretrainer(\n",
" network, num_classes=2, num_token_predictions=num_token_predictions, output='predictions')"
]
},
......@@ -213,7 +226,7 @@
"id": "d5h5HT7gNHx_"
},
"source": [
"Inspecting the `bert_pretrainer`, we see it wraps the `encoder` with additional `MaskedLM` and `Classification` heads."
"Inspecting the `bert_pretrainer`, we see it wraps the `encoder` with additional `MaskedLM` and `nlp.layers.ClassificationHead` heads."
]
},
{
......@@ -224,7 +237,7 @@
},
"outputs": [],
"source": [
"tf.keras.utils.plot_model(bert_pretrainer, show_shapes=True, dpi=48)"
"tf.keras.utils.plot_model(bert_pretrainer, show_shapes=True, expand_nested=True, dpi=48)"
]
},
{
......@@ -236,7 +249,9 @@
"outputs": [],
"source": [
"# We can feed some dummy data to get masked language model and sentence output.\n",
"sequence_length = 16\n",
"batch_size = 2\n",
"\n",
"word_id_data = np.random.randint(vocab_size, size=(batch_size, sequence_length))\n",
"mask_data = np.random.randint(2, size=(batch_size, sequence_length))\n",
"type_id_data = np.random.randint(2, size=(batch_size, sequence_length))\n",
......@@ -246,8 +261,8 @@
" [word_id_data, mask_data, type_id_data, masked_lm_positions_data])\n",
"lm_output = outputs[\"masked_lm\"]\n",
"sentence_output = outputs[\"classification\"]\n",
"print(lm_output)\n",
"print(sentence_output)"
"print(f'lm_output: shape={lm_output.shape}, dtype={lm_output.dtype!r}')\n",
"print(f'sentence_output: shape={sentence_output.shape}, dtype={sentence_output.dtype!r}')"
]
},
{
......@@ -272,14 +287,15 @@
"masked_lm_weights_data = np.random.randint(2, size=(batch_size, num_token_predictions))\n",
"next_sentence_labels_data = np.random.randint(2, size=(batch_size))\n",
"\n",
"mlm_loss = modeling.losses.weighted_sparse_categorical_crossentropy_loss(\n",
"mlm_loss = nlp.losses.weighted_sparse_categorical_crossentropy_loss(\n",
" labels=masked_lm_ids_data,\n",
" predictions=lm_output,\n",
" weights=masked_lm_weights_data)\n",
"sentence_loss = modeling.losses.weighted_sparse_categorical_crossentropy_loss(\n",
"sentence_loss = nlp.losses.weighted_sparse_categorical_crossentropy_loss(\n",
" labels=next_sentence_labels_data,\n",
" predictions=sentence_output)\n",
"loss = mlm_loss + sentence_loss\n",
"\n",
"print(loss)"
]
},
......@@ -290,8 +306,7 @@
},
"source": [
"With the loss, you can optimize the model.\n",
"After training, we can save the weights of TransformerEncoder for the downstream fine-tuning tasks. Please see [run_pretraining.py](https://github.com/tensorflow/models/blob/master/official/nlp/bert/run_pretraining.py) for the full example.\n",
"\n"
"After training, we can save the weights of TransformerEncoder for the downstream fine-tuning tasks. Please see [run_pretraining.py](https://github.com/tensorflow/models/blob/master/official/nlp/bert/run_pretraining.py) for the full example.\n"
]
},
{
......@@ -315,9 +330,9 @@
"source": [
"### Build a BertSpanLabeler wrapping BertEncoder\n",
"\n",
"[BertSpanLabeler](https://github.com/tensorflow/models/blob/master/official/nlp/modeling/models/bert_span_labeler.py) implements a simple single-span start-end predictor (that is, a model that predicts two values: a start token index and an end token index), suitable for SQuAD-style tasks.\n",
"The `nlp.models.BertSpanLabeler` class implements a simple single-span start-end predictor (that is, a model that predicts two values: a start token index and an end token index), suitable for SQuAD-style tasks.\n",
"\n",
"Note that `BertSpanLabeler` wraps a `BertEncoder`, the weights of which can be restored from the above pretraining model.\n"
"Note that `nlp.models.BertSpanLabeler` wraps a `nlp.networks.BertEncoder`, the weights of which can be restored from the above pretraining model.\n"
]
},
{
......@@ -328,11 +343,11 @@
},
"outputs": [],
"source": [
"network = modeling.networks.BertEncoder(\n",
" vocab_size=vocab_size, num_layers=2, sequence_length=sequence_length)\n",
"network = nlp.networks.BertEncoder(\n",
" vocab_size=vocab_size, num_layers=2)\n",
"\n",
"# Create a BERT trainer with the created network.\n",
"bert_span_labeler = modeling.models.BertSpanLabeler(network)"
"bert_span_labeler = nlp.models.BertSpanLabeler(network)"
]
},
{
......@@ -352,7 +367,7 @@
},
"outputs": [],
"source": [
"tf.keras.utils.plot_model(bert_span_labeler, show_shapes=True, dpi=48)"
"tf.keras.utils.plot_model(bert_span_labeler, show_shapes=True, expand_nested=True, dpi=48)"
]
},
{
......@@ -370,8 +385,9 @@
"\n",
"# Feed the data to the model.\n",
"start_logits, end_logits = bert_span_labeler([word_id_data, mask_data, type_id_data])\n",
"print(start_logits)\n",
"print(end_logits)"
"\n",
"print(f'start_logits: shape={start_logits.shape}, dtype={start_logits.dtype!r}')\n",
"print(f'end_logits: shape={end_logits.shape}, dtype={end_logits.dtype!r}')"
]
},
{
......@@ -432,7 +448,7 @@
"source": [
"### Build a BertClassifier model wrapping BertEncoder\n",
"\n",
"[BertClassifier](https://github.com/tensorflow/models/blob/master/official/nlp/modeling/models/bert_classifier.py) implements a [CLS] token classification model containing a single classification head."
"`nlp.models.BertClassifier` implements a [CLS] token classification model containing a single classification head."
]
},
{
......@@ -443,12 +459,12 @@
},
"outputs": [],
"source": [
"network = modeling.networks.BertEncoder(\n",
" vocab_size=vocab_size, num_layers=2, sequence_length=sequence_length)\n",
"network = nlp.networks.BertEncoder(\n",
" vocab_size=vocab_size, num_layers=2)\n",
"\n",
"# Create a BERT trainer with the created network.\n",
"num_classes = 2\n",
"bert_classifier = modeling.models.BertClassifier(\n",
"bert_classifier = nlp.models.BertClassifier(\n",
" network, num_classes=num_classes)"
]
},
......@@ -469,7 +485,7 @@
},
"outputs": [],
"source": [
"tf.keras.utils.plot_model(bert_classifier, show_shapes=True, dpi=48)"
"tf.keras.utils.plot_model(bert_classifier, show_shapes=True, expand_nested=True, dpi=48)"
]
},
{
......@@ -487,7 +503,7 @@
"\n",
"# Feed the data to the model.\n",
"logits = bert_classifier([word_id_data, mask_data, type_id_data])\n",
"print(logits)"
"print(f'logits: shape={logits.shape}, dtype={logits.dtype!r}')"
]
},
{
......@@ -529,8 +545,7 @@
"metadata": {
"colab": {
"collapsed_sections": [],
"name": "Introduction to the TensorFlow Models NLP library",
"private_outputs": true,
"name": "nlp_modeling_library_intro.ipynb",
"provenance": [],
"toc_visible": true
},
......
......@@ -12,3 +12,15 @@
# See the License for the specific language governing permissions and
# limitations under the License.
"""Core is shared by both `nlp` and `vision`."""
from official.core import actions
from official.core import base_task
from official.core import base_trainer
from official.core import config_definitions
from official.core import exp_factory
from official.core import export_base
from official.core import input_reader
from official.core import registry
from official.core import task_factory
from official.core import train_lib
from official.core import train_utils
......@@ -33,57 +33,6 @@ ExperimentConfig = config_definitions.ExperimentConfig
TrainerConfig = config_definitions.TrainerConfig
class Recovery:
"""Built-in model blowup recovery module.
Checks the loss value by the given threshold. If applicable, recover the
model by reading the checkpoint on disk.
"""
def __init__(self,
loss_upper_bound: float,
checkpoint_manager: tf.train.CheckpointManager,
recovery_begin_steps: int = 0,
recovery_max_trials: int = 3):
self.recover_counter = 0
self.recovery_begin_steps = recovery_begin_steps
self.recovery_max_trials = recovery_max_trials
self.loss_upper_bound = loss_upper_bound
self.checkpoint_manager = checkpoint_manager
def should_recover(self, loss_value, global_step):
if tf.math.is_nan(loss_value):
return True
if (global_step >= self.recovery_begin_steps and
loss_value > self.loss_upper_bound):
return True
return False
def maybe_recover(self, loss_value, global_step):
"""Conditionally recovers the training by triggering checkpoint restoration.
Args:
loss_value: the loss value as a float.
global_step: the number of global training steps.
Raises:
RuntimeError: when recovery happens more than the max number of trials,
the job should crash.
"""
if not self.should_recover(loss_value, global_step):
return
self.recover_counter += 1
if self.recover_counter > self.recovery_max_trials:
raise RuntimeError(
"The loss value is NaN or out of range after training loop and "
f"this happens {self.recover_counter} times.")
# Loads the previous good checkpoint.
checkpoint_path = self.checkpoint_manager.restore_or_initialize()
logging.warning(
"Recovering the model from checkpoint: %s. The loss value becomes "
"%f at step %d.", checkpoint_path, loss_value, global_step)
class _AsyncTrainer(orbit.StandardTrainer, orbit.StandardEvaluator):
"""Trainer class for both sync and async Strategy."""
......
......@@ -150,30 +150,6 @@ class MockAsyncTrainer(trainer_lib._AsyncTrainer):
return self.eval_global_step.numpy()
class RecoveryTest(tf.test.TestCase):
def test_recovery_module(self):
ckpt = tf.train.Checkpoint(v=tf.Variable(1, dtype=tf.int32))
model_dir = self.get_temp_dir()
manager = tf.train.CheckpointManager(ckpt, model_dir, max_to_keep=1)
recovery_module = trainer_lib.Recovery(
loss_upper_bound=1.0,
checkpoint_manager=manager,
recovery_begin_steps=1,
recovery_max_trials=1)
self.assertFalse(recovery_module.should_recover(1.1, 0))
self.assertFalse(recovery_module.should_recover(0.1, 1))
self.assertTrue(recovery_module.should_recover(1.1, 2))
# First triggers the recovery once.
recovery_module.maybe_recover(1.1, 10)
# Second time, it raises.
with self.assertRaisesRegex(
RuntimeError, 'The loss value is NaN .*'):
recovery_module.maybe_recover(1.1, 10)
class TrainerTest(tf.test.TestCase, parameterized.TestCase):
def setUp(self):
......
......@@ -76,6 +76,10 @@ class DataConfig(base_config.Config):
features. The main use case is to skip the image/video decoding for better
performance.
seed: An optional seed to use for deterministic shuffling/preprocessing.
prefetch_buffer_size: An int specifying the buffer size of prefetch
datasets. If None, the buffer size is autotuned. Specifying this is useful
in case autotuning uses up too much memory by making the buffer size too
high.
"""
input_path: Union[Sequence[str], str, base_config.Config] = ""
tfds_name: str = ""
......@@ -96,6 +100,7 @@ class DataConfig(base_config.Config):
tfds_as_supervised: bool = False
tfds_skip_decoding_feature: str = ""
seed: Optional[int] = None
prefetch_buffer_size: Optional[int] = None
@dataclasses.dataclass
......@@ -190,8 +195,8 @@ class TrainerConfig(base_config.Config):
is only used continuous_train_and_eval and continuous_eval modes. Default
value is 1 hrs.
train_steps: number of train steps.
validation_steps: number of eval steps. If `None`, the entire eval dataset
is used.
validation_steps: number of eval steps. If -1, the entire eval dataset is
used.
validation_interval: number of training steps to run between evaluations.
best_checkpoint_export_subdir: if set, the trainer will keep track of the
best evaluation metric, and export the corresponding best checkpoint under
......
......@@ -292,6 +292,8 @@ class InputReader:
self._transform_and_batch_fn = transform_and_batch_fn
self._postprocess_fn = postprocess_fn
self._seed = params.seed
self._prefetch_buffer_size = (params.prefetch_buffer_size or
tf.data.experimental.AUTOTUNE)
# When tf.data service is enabled, each data service worker should get
# different random seeds. Thus, we set `seed` to None.
......@@ -505,4 +507,4 @@ class InputReader:
options = tf.data.Options()
options.experimental_deterministic = self._deterministic
dataset = dataset.with_options(options)
return dataset.prefetch(tf.data.experimental.AUTOTUNE)
return dataset.prefetch(self._prefetch_buffer_size)
# Copyright 2022 The TensorFlow Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Adam optimizer with weight decay that exactly matches the original BERT."""
import re
from absl import logging
import tensorflow as tf
class AdamWeightDecay(tf.keras.optimizers.Adam):
"""Adam enables L2 weight decay and clip_by_global_norm on gradients.
[Warning!]: Keras optimizer supports gradient clipping and has an AdamW
implementation. Please consider evaluating the choice in Keras package.
Just adding the square of the weights to the loss function is *not* the
correct way of using L2 regularization/weight decay with Adam, since that will
interact with the m and v parameters in strange ways.
Instead we want to decay the weights in a manner that doesn't interact with
the m/v parameters. This is equivalent to adding the square of the weights to
the loss with plain (non-momentum) SGD.
"""
def __init__(self,
learning_rate=0.001,
beta_1=0.9,
beta_2=0.999,
epsilon=1e-7,
amsgrad=False,
weight_decay_rate=0.0,
include_in_weight_decay=None,
exclude_from_weight_decay=None,
gradient_clip_norm=1.0,
name='AdamWeightDecay',
**kwargs):
super(AdamWeightDecay, self).__init__(learning_rate, beta_1, beta_2,
epsilon, amsgrad, name, **kwargs)
self.weight_decay_rate = weight_decay_rate
self.gradient_clip_norm = gradient_clip_norm
self._include_in_weight_decay = include_in_weight_decay
self._exclude_from_weight_decay = exclude_from_weight_decay
logging.info('AdamWeightDecay gradient_clip_norm=%f', gradient_clip_norm)
def _prepare_local(self, var_device, var_dtype, apply_state):
super(AdamWeightDecay, self)._prepare_local(var_device, var_dtype, # pytype: disable=attribute-error # typed-keras
apply_state)
apply_state[(var_device, var_dtype)]['weight_decay_rate'] = tf.constant(
self.weight_decay_rate, name='adam_weight_decay_rate')
def _decay_weights_op(self, var, learning_rate, apply_state):
do_decay = self._do_use_weight_decay(var.name)
if do_decay:
return var.assign_sub(
learning_rate * var *
apply_state[(var.device, var.dtype.base_dtype)]['weight_decay_rate'],
use_locking=self._use_locking)
return tf.no_op()
def apply_gradients(self,
grads_and_vars,
name=None,
experimental_aggregate_gradients=True):
grads, tvars = list(zip(*grads_and_vars))
if experimental_aggregate_gradients and self.gradient_clip_norm > 0.0:
# when experimental_aggregate_gradients = False, apply_gradients() no
# longer implicitly allreduce gradients, users manually allreduce gradient
# and passed the allreduced grads_and_vars. For now, the
# clip_by_global_norm will be moved to before the explicit allreduce to
# keep the math the same as TF 1 and pre TF 2.2 implementation.
(grads, _) = tf.clip_by_global_norm(
grads, clip_norm=self.gradient_clip_norm)
return super(AdamWeightDecay, self).apply_gradients(
zip(grads, tvars),
name=name,
experimental_aggregate_gradients=experimental_aggregate_gradients)
def _get_lr(self, var_device, var_dtype, apply_state):
"""Retrieves the learning rate with the given state."""
if apply_state is None:
return self._decayed_lr_t[var_dtype], {}
apply_state = apply_state or {}
coefficients = apply_state.get((var_device, var_dtype))
if coefficients is None:
coefficients = self._fallback_apply_state(var_device, var_dtype)
apply_state[(var_device, var_dtype)] = coefficients
return coefficients['lr_t'], dict(apply_state=apply_state)
def _resource_apply_dense(self, grad, var, apply_state=None):
lr_t, kwargs = self._get_lr(var.device, var.dtype.base_dtype, apply_state)
decay = self._decay_weights_op(var, lr_t, apply_state)
with tf.control_dependencies([decay]):
return super(AdamWeightDecay,
self)._resource_apply_dense(grad, var, **kwargs) # pytype: disable=attribute-error # typed-keras
def _resource_apply_sparse(self, grad, var, indices, apply_state=None):
lr_t, kwargs = self._get_lr(var.device, var.dtype.base_dtype, apply_state)
decay = self._decay_weights_op(var, lr_t, apply_state)
with tf.control_dependencies([decay]):
return super(AdamWeightDecay,
self)._resource_apply_sparse(grad, var, indices, **kwargs) # pytype: disable=attribute-error # typed-keras
def get_config(self):
config = super(AdamWeightDecay, self).get_config()
config.update({
'weight_decay_rate': self.weight_decay_rate,
})
return config
def _do_use_weight_decay(self, param_name):
"""Whether to use L2 weight decay for `param_name`."""
if self.weight_decay_rate == 0:
return False
if self._include_in_weight_decay:
for r in self._include_in_weight_decay:
if re.search(r, param_name) is not None:
return True
if self._exclude_from_weight_decay:
for r in self._exclude_from_weight_decay:
if re.search(r, param_name) is not None:
return False
return True
......@@ -18,20 +18,21 @@ from typing import Callable, Optional, Union, List, Tuple
import gin
import tensorflow as tf
import tensorflow_addons.optimizers as tfa_optimizers
from official.modeling.optimization import slide_optimizer
from official.modeling.optimization import adafactor_optimizer
from official.modeling.optimization import ema_optimizer
from official.modeling.optimization import lars_optimizer
from official.modeling.optimization import legacy_adamw
from official.modeling.optimization import lr_schedule
from official.modeling.optimization.configs import optimization_config as opt_cfg
from official.nlp import optimization as nlp_optimization
OPTIMIZERS_CLS = {
'sgd': tf.keras.optimizers.SGD,
# TODO(chenmoneygithub): experimental.SGD
'adam': tf.keras.optimizers.Adam,
# TODO(chenmoneygithub): experimental.Adam
'adamw': nlp_optimization.AdamWeightDecay,
'adamw': legacy_adamw.AdamWeightDecay,
'lamb': tfa_optimizers.LAMB,
'rmsprop': tf.keras.optimizers.RMSprop,
'lars': lars_optimizer.LARS,
......@@ -57,8 +58,8 @@ WARMUP_CLS = {
}
def register_optimizer_cls(
key: str, optimizer_config_cls: tf.keras.optimizers.Optimizer):
def register_optimizer_cls(key: str,
optimizer_config_cls: tf.keras.optimizers.Optimizer):
"""Register customize optimizer cls.
The user will still need to subclass data classes in
......@@ -85,6 +86,8 @@ class OptimizerFactory:
(4) Build optimizer.
This is a typical example for using this class:
```
params = {
'optimizer': {
'type': 'sgd',
......@@ -104,6 +107,7 @@ class OptimizerFactory:
opt_factory = OptimizerFactory(opt_config)
lr = opt_factory.build_learning_rate()
optimizer = opt_factory.build_optimizer(lr)
```
"""
def __init__(self, config: opt_cfg.OptimizationConfig):
......@@ -156,9 +160,12 @@ class OptimizerFactory:
def build_optimizer(
self,
lr: Union[tf.keras.optimizers.schedules.LearningRateSchedule, float],
gradient_aggregator: Optional[Callable[
[List[Tuple[tf.Tensor, tf.Tensor]]], List[Tuple[tf.Tensor,
tf.Tensor]]]] = None,
gradient_transformers: Optional[List[Callable[
[List[Tuple[tf.Tensor, tf.Tensor]]], List[Tuple[tf.Tensor, tf.Tensor]]
]]] = None,
[List[Tuple[tf.Tensor, tf.Tensor]]], List[Tuple[tf.Tensor,
tf.Tensor]]]]] = None,
postprocessor: Optional[Callable[[tf.keras.optimizers.Optimizer],
tf.keras.optimizers.Optimizer]] = None):
"""Build optimizer.
......@@ -170,6 +177,7 @@ class OptimizerFactory:
Args:
lr: A floating point value, or a
tf.keras.optimizers.schedules.LearningRateSchedule instance.
gradient_aggregator: Optional function to overwrite gradient aggregation.
gradient_transformers: Optional list of functions to use to transform
gradients before applying updates to Variables. The functions are
applied after gradient_aggregator. The functions should accept and
......@@ -193,6 +201,8 @@ class OptimizerFactory:
del optimizer_dict['global_clipnorm']
optimizer_dict['learning_rate'] = lr
if gradient_aggregator is not None:
optimizer_dict['gradient_aggregator'] = gradient_aggregator
if gradient_transformers is not None:
optimizer_dict['gradient_transformers'] = gradient_transformers
......
......@@ -49,6 +49,39 @@ class OptimizerFactoryTest(tf.test.TestCase, parameterized.TestCase):
self.assertIsInstance(optimizer, optimizer_cls)
self.assertEqual(expected_optimizer_config, optimizer.get_config())
def test_gradient_aggregator(self):
params = {
'optimizer': {
'type': 'adam',
},
'learning_rate': {
'type': 'constant',
'constant': {
'learning_rate': 1.0
}
}
}
opt_config = optimization_config.OptimizationConfig(params)
opt_factory = optimizer_factory.OptimizerFactory(opt_config)
lr = opt_factory.build_learning_rate()
# Dummy function to zero out gradients.
zero_grads = lambda gv: [(tf.zeros_like(g), v) for g, v in gv]
optimizer = opt_factory.build_optimizer(lr, gradient_aggregator=zero_grads)
var0 = tf.Variable([1.0, 2.0])
var1 = tf.Variable([3.0, 4.0])
grads0 = tf.constant([1.0, 1.0])
grads1 = tf.constant([1.0, 1.0])
grads_and_vars = list(zip([grads0, grads1], [var0, var1]))
optimizer.apply_gradients(grads_and_vars)
self.assertAllClose(np.array([1.0, 2.0]), var0.numpy())
self.assertAllClose(np.array([3.0, 4.0]), var1.numpy())
@parameterized.parameters((None, None), (1.0, None), (None, 1.0))
def test_gradient_clipping(self, clipnorm, clipvalue):
params = {
......@@ -418,7 +451,7 @@ class OptimizerFactoryTest(tf.test.TestCase, parameterized.TestCase):
}
}
}
expected_lr_step_values = [[0, 0.0], [5000, 1e-4/2.0], [10000, 1e-4],
expected_lr_step_values = [[0, 0.0], [5000, 1e-4 / 2.0], [10000, 1e-4],
[20000, 9.994863e-05], [499999, 5e-05]]
opt_config = optimization_config.OptimizationConfig(params)
opt_factory = optimizer_factory.OptimizerFactory(opt_config)
......@@ -434,10 +467,12 @@ class OptimizerFactoryRegistryTest(tf.test.TestCase):
class MyClass():
pass
optimizer_factory.register_optimizer_cls('test', MyClass)
self.assertIn('test', optimizer_factory.OPTIMIZERS_CLS)
with self.assertRaisesRegex(ValueError, 'test already registered.*'):
optimizer_factory.register_optimizer_cls('test', MyClass)
if __name__ == '__main__':
tf.test.main()
# TensorFlow NLP Modelling Toolkit
# TF-NLP Model Garden
⚠️ Disclaimer: All datasets hyperlinked from this page are not owned or
distributed by Google. The dataset is made available by third parties.
Please review the terms and conditions made available by the third parties
before using the data.
This codebase provides a Natrual Language Processing modeling toolkit written in
[TF2](https://www.tensorflow.org/guide/effective_tf2). It allows researchers and
......@@ -30,7 +35,10 @@ research ideas. Detailed intructions can be found in READMEs in each folder.
We provide SoTA model implementations, pre-trained models, training and
evaluation examples, and command lines. Detail instructions can be found in the
READMEs for specific papers.
READMEs for specific papers. Below are some papers implemented in the
repository and more NLP projects can be found in the
[`projects`](https://github.com/tensorflow/models/tree/master/official/projects)
folder:
1. [BERT](MODEL_GARDEN.md#available-model-configs): [BERT: Pre-training of Deep Bidirectional Transformers for
Language Understanding](https://arxiv.org/abs/1810.04805) by Devlin et al.,
......@@ -38,10 +46,10 @@ READMEs for specific papers.
2. [ALBERT](MODEL_GARDEN.md#available-model-configs):
[A Lite BERT for Self-supervised Learning of Language Representations](https://arxiv.org/abs/1909.11942)
by Lan et al., 2019
3. [XLNet](xlnet):
3. [XLNet](MODEL_GARDEN.md):
[XLNet: Generalized Autoregressive Pretraining for Language Understanding](https://arxiv.org/abs/1906.08237)
by Yang et al., 2019
4. [Transformer for translation](transformer):
4. [Transformer for translation](MODEL_GARDEN.md#available-model-configs):
[Attention Is All You Need](https://arxiv.org/abs/1706.03762) by Vaswani et
al., 2017
......
......@@ -17,4 +17,3 @@
from official.nlp.configs import finetuning_experiments
from official.nlp.configs import pretraining_experiments
from official.nlp.configs import wmt_transformer_experiments
from official.projects.teams import teams_experiments
......@@ -187,6 +187,8 @@ class AxProcessor(DataProcessor):
def _create_examples_tfds(self, dataset, set_type):
"""Creates examples for the training/dev/test sets."""
dataset = list(dataset)
dataset.sort(key=lambda x: x["idx"])
examples = []
for i, example in enumerate(dataset):
guid = "%s-%s" % (set_type, i)
......@@ -218,6 +220,8 @@ class ColaProcessor(DefaultGLUEDataProcessor):
"""Creates examples for the training/dev/test sets."""
dataset = tfds.load(
"glue/cola", split=set_type, try_gcs=True).as_numpy_iterator()
dataset = list(dataset)
dataset.sort(key=lambda x: x["idx"])
examples = []
for i, example in enumerate(dataset):
guid = "%s-%s" % (set_type, i)
......@@ -312,6 +316,8 @@ class MnliProcessor(DataProcessor):
"""Creates examples for the training/dev/test sets."""
dataset = tfds.load(
"glue/mnli", split=set_type, try_gcs=True).as_numpy_iterator()
dataset = list(dataset)
dataset.sort(key=lambda x: x["idx"])
examples = []
for i, example in enumerate(dataset):
guid = "%s-%s" % (set_type, i)
......@@ -343,6 +349,8 @@ class MrpcProcessor(DefaultGLUEDataProcessor):
"""Creates examples for the training/dev/test sets."""
dataset = tfds.load(
"glue/mrpc", split=set_type, try_gcs=True).as_numpy_iterator()
dataset = list(dataset)
dataset.sort(key=lambda x: x["idx"])
examples = []
for i, example in enumerate(dataset):
guid = "%s-%s" % (set_type, i)
......@@ -453,6 +461,8 @@ class QnliProcessor(DefaultGLUEDataProcessor):
"""Creates examples for the training/dev/test sets."""
dataset = tfds.load(
"glue/qnli", split=set_type, try_gcs=True).as_numpy_iterator()
dataset = list(dataset)
dataset.sort(key=lambda x: x["idx"])
examples = []
for i, example in enumerate(dataset):
guid = "%s-%s" % (set_type, i)
......@@ -484,6 +494,8 @@ class QqpProcessor(DefaultGLUEDataProcessor):
"""Creates examples for the training/dev/test sets."""
dataset = tfds.load(
"glue/qqp", split=set_type, try_gcs=True).as_numpy_iterator()
dataset = list(dataset)
dataset.sort(key=lambda x: x["idx"])
examples = []
for i, example in enumerate(dataset):
guid = "%s-%s" % (set_type, i)
......@@ -517,6 +529,8 @@ class RteProcessor(DefaultGLUEDataProcessor):
"""Creates examples for the training/dev/test sets."""
dataset = tfds.load(
"glue/rte", split=set_type, try_gcs=True).as_numpy_iterator()
dataset = list(dataset)
dataset.sort(key=lambda x: x["idx"])
examples = []
for i, example in enumerate(dataset):
guid = "%s-%s" % (set_type, i)
......@@ -548,6 +562,8 @@ class SstProcessor(DefaultGLUEDataProcessor):
"""Creates examples for the training/dev/test sets."""
dataset = tfds.load(
"glue/sst2", split=set_type, try_gcs=True).as_numpy_iterator()
dataset = list(dataset)
dataset.sort(key=lambda x: x["idx"])
examples = []
for i, example in enumerate(dataset):
guid = "%s-%s" % (set_type, i)
......@@ -574,6 +590,8 @@ class StsBProcessor(DefaultGLUEDataProcessor):
"""Creates examples for the training/dev/test sets."""
dataset = tfds.load(
"glue/stsb", split=set_type, try_gcs=True).as_numpy_iterator()
dataset = list(dataset)
dataset.sort(key=lambda x: x["idx"])
examples = []
for i, example in enumerate(dataset):
guid = "%s-%s" % (set_type, i)
......@@ -742,6 +760,8 @@ class WnliProcessor(DefaultGLUEDataProcessor):
"""Creates examples for the training/dev/test sets."""
dataset = tfds.load(
"glue/wnli", split=set_type, try_gcs=True).as_numpy_iterator()
dataset = list(dataset)
dataset.sort(key=lambda x: x["idx"])
examples = []
for i, example in enumerate(dataset):
guid = "%s-%s" % (set_type, i)
......
......@@ -178,13 +178,13 @@ class KernelAttention(tf.keras.layers.MultiHeadAttention):
is_short_seq=False,
begin_kernel=0,
scale=None,
scale_by_length=False,
**kwargs):
r"""Constructor of KernelAttention.
Args:
feature_transform: A non-linear transform of the keys and quries.
Possible transforms are "elu", "relu", "square", "exp", "expmod",
"identity".
feature_transform: A non-linear transform of the keys and quries. Possible
transforms are "elu", "relu", "square", "exp", "expmod", "identity".
num_random_features: Number of random features to be used for projection.
if num_random_features <= 0, no production is used before transform.
seed: The seed to begin drawing random features. Once the seed is set, the
......@@ -194,12 +194,16 @@ class KernelAttention(tf.keras.layers.MultiHeadAttention):
redraw: Whether to redraw projection every forward pass during training.
The argument is only effective when num_random_features > 0.
is_short_seq: boolean predicate indicating whether input data consists of
very short sequences or not; in most cases this should be False
(default option).
very short sequences or not; in most cases this should be False (default
option).
begin_kernel: Apply kernel_attention after this sequence id and apply
softmax attention before this.
scale: The value to scale the dot product as described in `Attention Is
All You Need`. If None, we use 1/sqrt(dk) as described in the paper.
scale_by_length: boolean predicate indicating whether additionally scale
the dot product based on key length. Set as log_512^(n) to stablize
attention entropy against length. Refer to
https://kexue.fm/archives/8823 for details.
**kwargs: The same arguments `MultiHeadAttention` layer.
"""
if feature_transform not in _TRANSFORM_MAP:
......@@ -214,6 +218,7 @@ class KernelAttention(tf.keras.layers.MultiHeadAttention):
self._redraw = redraw
self._is_short_seq = is_short_seq
self._begin_kernel = begin_kernel
self._scale_by_length = scale_by_length
# We use the seed for two scenarios:
# 1. inference
# 2. no redraw
......@@ -252,9 +257,9 @@ class KernelAttention(tf.keras.layers.MultiHeadAttention):
is_short_seq: boolean predicate indicating whether input data consists of
short or long sequences; usually short sequence is defined as having
length L <= 1024.
attention_mask: a boolean mask of shape `[B, S]`, that prevents
attenting to masked positions. Note that the mask is only appied to
the keys. User may want to mask the output if query contains pads.
attention_mask: a boolean mask of shape `[B, S]`, that prevents attenting
to masked positions. Note that the mask is only appied to the keys. User
may want to mask the output if query contains pads.
training: Python boolean indicating whether the layer should behave in
training mode (adding dropout) or in inference mode (doing nothing).
numeric_stabler: A scalar value added to avoid divide by 0.
......@@ -270,17 +275,23 @@ class KernelAttention(tf.keras.layers.MultiHeadAttention):
else:
projection_matrix = self._projection_matrix
if self._scale_by_length:
scale = tf.math.log(tf.reduce_sum(attention_mask,
axis=-1)) * self._scale / math.log(512)
scale = tf.reshape(scale, [-1, 1, 1, 1])
else:
scale = self._scale
if is_short_seq:
# Note: Applying scalar multiply at the smaller end of einsum improves
# XLA performance, but may introduce slight numeric differences in
# the Transformer attention head.
query = query * self._scale
query = query * scale
else:
# Note: we suspect spliting the scale to key, query yields smaller
# approximation variance when random projection is used.
# For simplicity, we also split when there's no random projection.
key *= math.sqrt(self._scale)
query *= math.sqrt(self._scale)
key *= tf.math.sqrt(scale)
query *= tf.math.sqrt(scale)
key = _TRANSFORM_MAP[feature_transform](key, projection_matrix)
query = _TRANSFORM_MAP[feature_transform](query, projection_matrix)
......@@ -330,9 +341,9 @@ class KernelAttention(tf.keras.layers.MultiHeadAttention):
value: Value `Tensor` of shape `[B, S, dim]`.
key: Optional key `Tensor` of shape `[B, S, dim]`. If not given, will use
`value` for both `key` and `value`, which is the most common case.
attention_mask: a boolean mask of shape `[B, S]`, that prevents
attenting to masked positions. Note that the mask is only appied to
the keys. User may want to mask the output if query contains pads.
attention_mask: a boolean mask of shape `[B, S]`, that prevents attenting
to masked positions. Note that the mask is only appied to the keys. User
may want to mask the output if query contains pads.
training: Python boolean indicating whether the layer should behave in
training mode (adding dropout) or in inference mode (doing nothing).
......@@ -373,9 +384,10 @@ class KernelAttention(tf.keras.layers.MultiHeadAttention):
attention_output = tf.concat(
[attention_output_softmax, attention_output_kernel], axis=1)
else:
attention_output = self._compute_attention(
query, key, value, self._feature_transform,
self._is_short_seq, attention_mask, training)
attention_output = self._compute_attention(query, key, value,
self._feature_transform,
self._is_short_seq,
attention_mask, training)
# This is actually dropping out entire tokens to attend to, which might
# seem a bit unusual, but is taken from the original Transformer paper.
attention_output = self._dropout_layer(attention_output)
......
......@@ -30,9 +30,9 @@ _BEGIN_KERNEL = [0, 512]
class KernelAttentionTest(tf.test.TestCase, parameterized.TestCase):
@parameterized.parameters(itertools.product(
_FEATURE_TRANSFORM, [127], _TRAINING, [True, False],
_IS_SHORT_SEQ, _BEGIN_KERNEL))
@parameterized.parameters(
itertools.product(_FEATURE_TRANSFORM, [127], _TRAINING, [True, False],
_IS_SHORT_SEQ, _BEGIN_KERNEL))
def test_attention_projection(
self, feature_transform, num_random_features, training, redraw, is_short,
begin_kernel):
......@@ -90,6 +90,32 @@ class KernelAttentionTest(tf.test.TestCase, parameterized.TestCase):
training=training)
self.assertEqual(output.shape, [batch_size, seq_length, key_dim])
@parameterized.parameters([128, 512])
def test_attention_scale_by_length(self, seq_length):
num_heads = 12
key_dim = 64
batch_size = 2
test_layer = attention.KernelAttention(
num_heads=num_heads,
key_dim=key_dim,
num_random_features=0,
scale_by_length=True)
query = tf.random.normal(
shape=(batch_size, seq_length, key_dim))
value = query
encoder_inputs_mask = tf.ones((batch_size, seq_length), dtype=tf.int32)
masks = tf.cast(encoder_inputs_mask, dtype=tf.float32)
output_scale_by_length = test_layer(
query=query, value=value, attention_mask=masks)
test_layer._scale_by_length = False
output_no_scale_by_length = test_layer(
query=query, value=value, attention_mask=masks)
if seq_length == 512: # Equals because log(seq_length, base=512) = 1.0
self.assertAllClose(output_scale_by_length, output_no_scale_by_length)
else:
self.assertNotAllClose(output_scale_by_length, output_no_scale_by_length)
def test_unsupported_feature_transform(self):
with self.assertRaisesRegex(ValueError, 'Unsupported feature_transform.*'):
_ = attention.KernelAttention(feature_transform='test')
......
......@@ -14,6 +14,7 @@
"""Keras-based TransformerEncoder block layer."""
from absl import logging
import tensorflow as tf
from official.nlp.modeling.layers import util
......@@ -176,9 +177,9 @@ class TransformerEncoderBlock(tf.keras.layers.Layer):
einsum_equation = "...bc,cd->...bd"
hidden_size = input_tensor_shape[-1]
if hidden_size % self._num_heads != 0:
raise ValueError(
logging.warning(
"The input size (%d) is not a multiple of the number of attention "
"heads (%d)" % (hidden_size, self._num_heads))
"heads (%d)", hidden_size, self._num_heads)
if self._key_dim is None:
self._key_dim = int(hidden_size // self._num_heads)
if self._output_last_dim is None:
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment