Unverified Commit 44f6d511 authored by Srihari Humbarwadi's avatar Srihari Humbarwadi Committed by GitHub
Browse files

Merge branch 'tensorflow:master' into panoptic-deeplab

parents 686a287d 8bc5a1a5
...@@ -3,7 +3,8 @@ ...@@ -3,7 +3,8 @@
</div> </div>
[![Python](https://img.shields.io/pypi/pyversions/tensorflow.svg?style=plastic)](https://badge.fury.io/py/tensorflow) [![Python](https://img.shields.io/pypi/pyversions/tensorflow.svg?style=plastic)](https://badge.fury.io/py/tensorflow)
[![PyPI](https://badge.fury.io/py/tensorflow.svg)](https://badge.fury.io/py/tensorflow) [![tf-models-official PyPI](https://badge.fury.io/py/tf-models-official.svg)](https://badge.fury.io/py/tf-models-official)
# Welcome to the Model Garden for TensorFlow # Welcome to the Model Garden for TensorFlow
...@@ -32,7 +33,8 @@ To install the current release of tensorflow-models, please follow any one of th ...@@ -32,7 +33,8 @@ To install the current release of tensorflow-models, please follow any one of th
<details> <details>
**tf-models-official** is the stable Model Garden package. **tf-models-official** is the stable Model Garden package. Please check out the [releases](https://github.com/tensorflow/models/releases) to see what are available modules.
pip will install all models and dependencies automatically. pip will install all models and dependencies automatically.
```shell ```shell
......
...@@ -19,7 +19,7 @@ This repository provides a curated list of the GitHub repositories with machine ...@@ -19,7 +19,7 @@ This repository provides a curated list of the GitHub repositories with machine
| [ResNet 101](https://github.com/IntelAI/models/tree/master/benchmarks/image_recognition/tensorflow/resnet101) | [Deep Residual Learning for Image Recognition](https://arxiv.org/pdf/1512.03385) | • Int8 Inference<br/>• FP32 Inference | [Intel](https://github.com/IntelAI) | | [ResNet 101](https://github.com/IntelAI/models/tree/master/benchmarks/image_recognition/tensorflow/resnet101) | [Deep Residual Learning for Image Recognition](https://arxiv.org/pdf/1512.03385) | • Int8 Inference<br/>• FP32 Inference | [Intel](https://github.com/IntelAI) |
| [ResNet 50](https://github.com/IntelAI/models/tree/master/benchmarks/image_recognition/tensorflow/resnet50) | [Deep Residual Learning for Image Recognition](https://arxiv.org/pdf/1512.03385) | • Int8 Inference<br/>• FP32 Inference | [Intel](https://github.com/IntelAI) | | [ResNet 50](https://github.com/IntelAI/models/tree/master/benchmarks/image_recognition/tensorflow/resnet50) | [Deep Residual Learning for Image Recognition](https://arxiv.org/pdf/1512.03385) | • Int8 Inference<br/>• FP32 Inference | [Intel](https://github.com/IntelAI) |
| [ResNet 50v1.5](https://github.com/IntelAI/models/tree/master/benchmarks/image_recognition/tensorflow/resnet50v1_5) | [Deep Residual Learning for Image Recognition](https://arxiv.org/pdf/1512.03385) | • Int8 Inference<br/>• FP32 Inference<br/>• FP32 Training | [Intel](https://github.com/IntelAI) | | [ResNet 50v1.5](https://github.com/IntelAI/models/tree/master/benchmarks/image_recognition/tensorflow/resnet50v1_5) | [Deep Residual Learning for Image Recognition](https://arxiv.org/pdf/1512.03385) | • Int8 Inference<br/>• FP32 Inference<br/>• FP32 Training | [Intel](https://github.com/IntelAI) |
| [EfficientNet](https://github.com/NVIDIA/DeepLearningExamples/tree/master/TensorFlow2/Classification/ConvNets/efficientnet) | [EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks](https://arxiv.org/pdf/1905.11946.pdf) | • Automatic mixed precision<br/>• Horovod Multi-GPU training (NCCL)<br/>• Multi-node training on a Pyxis/Enroot Slurm cluster<br/>• XLA | [NVIDIA](https://github.com/NVIDIA) | | EfficientNet [v1](https://github.com/NVIDIA/DeepLearningExamples/tree/master/TensorFlow2/Classification/ConvNets/efficientnet_v1) [v2](https://github.com/NVIDIA/DeepLearningExamples/tree/master/TensorFlow2/Classification/ConvNets/efficientnet_v2) | [EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks](https://arxiv.org/pdf/1905.11946.pdf) | • Automatic mixed precision<br/>• Horovod Multi-GPU training (NCCL)<br/>• Multi-node training on a Pyxis/Enroot Slurm cluster<br/>• XLA | [NVIDIA](https://github.com/NVIDIA) |
### Object Detection ### Object Detection
......
...@@ -38,16 +38,15 @@ In the near future, we will add: ...@@ -38,16 +38,15 @@ In the near future, we will add:
## Models and Implementations ## Models and Implementations
### Computer Vision ### [Computer Vision](vision/README.md)
#### Image Classification #### Image Classification
| Model | Reference (Paper) | | Model | Reference (Paper) |
|-------|-------------------| |-------|-------------------|
| [MNIST](legacy/image_classification) | A basic model to classify digits from the [MNIST dataset](http://yann.lecun.com/exdb/mnist/) |
| [ResNet](vision/MODEL_GARDEN.md) | [Deep Residual Learning for Image Recognition](https://arxiv.org/abs/1512.03385) | | [ResNet](vision/MODEL_GARDEN.md) | [Deep Residual Learning for Image Recognition](https://arxiv.org/abs/1512.03385) |
| [ResNet-RS](vision/MODEL_GARDEN.md) | [Revisiting ResNets: Improved Training and Scaling Strategies](https://arxiv.org/abs/2103.07579) | | [ResNet-RS](vision/MODEL_GARDEN.md) | [Revisiting ResNets: Improved Training and Scaling Strategies](https://arxiv.org/abs/2103.07579) |
| [EfficientNet](legacy/image_classification) | [EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks](https://arxiv.org/abs/1905.11946) | | [EfficientNet](vision/MODEL_GARDEN.md) | [EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks](https://arxiv.org/abs/1905.11946) |
| [Vision Transformer](vision/MODEL_GARDEN.md) | [An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://arxiv.org/abs/2010.11929) | | [Vision Transformer](vision/MODEL_GARDEN.md) | [An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://arxiv.org/abs/2010.11929) |
#### Object Detection and Segmentation #### Object Detection and Segmentation
...@@ -56,7 +55,6 @@ In the near future, we will add: ...@@ -56,7 +55,6 @@ In the near future, we will add:
|-------|-------------------| |-------|-------------------|
| [RetinaNet](vision/MODEL_GARDEN.md) | [Focal Loss for Dense Object Detection](https://arxiv.org/abs/1708.02002) | | [RetinaNet](vision/MODEL_GARDEN.md) | [Focal Loss for Dense Object Detection](https://arxiv.org/abs/1708.02002) |
| [Mask R-CNN](vision/MODEL_GARDEN.md) | [Mask R-CNN](https://arxiv.org/abs/1703.06870) | | [Mask R-CNN](vision/MODEL_GARDEN.md) | [Mask R-CNN](https://arxiv.org/abs/1703.06870) |
| [ShapeMask](legacy/detection) | [ShapeMask: Learning to Segment Novel Objects by Refining Shape Priors](https://arxiv.org/abs/1904.03239) |
| [SpineNet](vision/MODEL_GARDEN.md) | [SpineNet: Learning Scale-Permuted Backbone for Recognition and Localization](https://arxiv.org/abs/1912.05027) | | [SpineNet](vision/MODEL_GARDEN.md) | [SpineNet: Learning Scale-Permuted Backbone for Recognition and Localization](https://arxiv.org/abs/1912.05027) |
| [Cascade RCNN-RS and RetinaNet-RS](vision/MODEL_GARDEN.md) | [Simple Training Strategies and Model Scaling for Object Detection](https://arxiv.org/abs/2107.00057)| | [Cascade RCNN-RS and RetinaNet-RS](vision/MODEL_GARDEN.md) | [Simple Training Strategies and Model Scaling for Object Detection](https://arxiv.org/abs/2107.00057)|
...@@ -66,7 +64,7 @@ In the near future, we will add: ...@@ -66,7 +64,7 @@ In the near future, we will add:
|-------|-------------------| |-------|-------------------|
| [Mobile Video Networks (MoViNets)](projects/movinet) | [MoViNets: Mobile Video Networks for Efficient Video Recognition](https://arxiv.org/abs/2103.11511) | | [Mobile Video Networks (MoViNets)](projects/movinet) | [MoViNets: Mobile Video Networks for Efficient Video Recognition](https://arxiv.org/abs/2103.11511) |
### Natural Language Processing ### [Natural Language Processing](nlp/README.md)
| Model | Reference (Paper) | | Model | Reference (Paper) |
|-------|-------------------| |-------|-------------------|
...@@ -74,7 +72,6 @@ In the near future, we will add: ...@@ -74,7 +72,6 @@ In the near future, we will add:
| [BERT (Bidirectional Encoder Representations from Transformers)](nlp/MODEL_GARDEN.md#available-model-configs) | [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805) | | [BERT (Bidirectional Encoder Representations from Transformers)](nlp/MODEL_GARDEN.md#available-model-configs) | [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805) |
| [NHNet (News Headline generation model)](projects/nhnet) | [Generating Representative Headlines for News Stories](https://arxiv.org/abs/2001.09386) | | [NHNet (News Headline generation model)](projects/nhnet) | [Generating Representative Headlines for News Stories](https://arxiv.org/abs/2001.09386) |
| [Transformer](nlp/MODEL_GARDEN.md#available-model-configs) | [Attention Is All You Need](https://arxiv.org/abs/1706.03762) | | [Transformer](nlp/MODEL_GARDEN.md#available-model-configs) | [Attention Is All You Need](https://arxiv.org/abs/1706.03762) |
| [XLNet](nlp/xlnet) | [XLNet: Generalized Autoregressive Pretraining for Language Understanding](https://arxiv.org/abs/1906.08237) |
| [MobileBERT](projects/mobilebert) | [MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices](https://arxiv.org/abs/2004.02984) | | [MobileBERT](projects/mobilebert) | [MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices](https://arxiv.org/abs/2004.02984) |
### Recommendation ### Recommendation
......
...@@ -34,14 +34,10 @@ ...@@ -34,14 +34,10 @@
{ {
"cell_type": "markdown", "cell_type": "markdown",
"metadata": { "metadata": {
"id": "fsACVQpVSifi" "id": "2X-XaMSVcLua"
}, },
"source": [ "source": [
"### Install the TensorFlow Model Garden pip package\n", "# Decoding API"
"\n",
"* `tf-models-official` is the stable Model Garden package. Note that it may not include the latest changes in the `tensorflow_models` github repo. To include latest changes, you may install `tf-models-nightly`,\n",
"which is the nightly Model Garden package created daily automatically.\n",
"* pip will install all models and dependencies automatically."
] ]
}, },
{ {
...@@ -66,6 +62,30 @@ ...@@ -66,6 +62,30 @@
"\u003c/table\u003e" "\u003c/table\u003e"
] ]
}, },
{
"cell_type": "markdown",
"metadata": {
"id": "fsACVQpVSifi"
},
"source": [
"### Install the TensorFlow Model Garden pip package\n",
"\n",
"* `tf-models-official` is the stable Model Garden package. Note that it may not include the latest changes in the `tensorflow_models` github repo. To include latest changes, you may install `tf-models-nightly`,\n",
"which is the nightly Model Garden package created daily automatically.\n",
"* pip will install all models and dependencies automatically."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "G4BhAu01HZcM"
},
"outputs": [],
"source": [
"!pip uninstall -y opencv-python"
]
},
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": null, "execution_count": null,
...@@ -74,7 +94,7 @@ ...@@ -74,7 +94,7 @@
}, },
"outputs": [], "outputs": [],
"source": [ "source": [
"pip install tf-models-nightly" "!pip install tf-models-official"
] ]
}, },
{ {
...@@ -92,9 +112,20 @@ ...@@ -92,9 +112,20 @@
"\n", "\n",
"import tensorflow as tf\n", "import tensorflow as tf\n",
"\n", "\n",
"from official import nlp\n", "from tensorflow_models import nlp"
"from official.nlp.modeling.ops import sampling_module\n", ]
"from official.nlp.modeling.ops import beam_search" },
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "T92ccAzlnGqh"
},
"outputs": [],
"source": [
"def length_norm(length, dtype):\n",
" \"\"\"Return length normalization factor.\"\"\"\n",
" return tf.pow(((5. + tf.cast(length, dtype)) / 6.), 0.0)"
] ]
}, },
{ {
...@@ -103,7 +134,8 @@ ...@@ -103,7 +134,8 @@
"id": "0AWgyo-IQ5sP" "id": "0AWgyo-IQ5sP"
}, },
"source": [ "source": [
"# Decoding API\n", "## Overview\n",
"\n",
"This API provides an interface to experiment with different decoding strategies used for auto-regressive models.\n", "This API provides an interface to experiment with different decoding strategies used for auto-regressive models.\n",
"\n", "\n",
"1. The following sampling strategies are provided in sampling_module.py, which inherits from the base Decoding class:\n", "1. The following sampling strategies are provided in sampling_module.py, which inherits from the base Decoding class:\n",
...@@ -182,7 +214,7 @@ ...@@ -182,7 +214,7 @@
"id": "lV1RRp6ihnGX" "id": "lV1RRp6ihnGX"
}, },
"source": [ "source": [
"# Initialize the Model Hyper-parameters" "## Initialize the Model Hyper-parameters"
] ]
}, },
{ {
...@@ -193,44 +225,32 @@ ...@@ -193,44 +225,32 @@
}, },
"outputs": [], "outputs": [],
"source": [ "source": [
"params = {}\n", "params = {\n",
"params['num_heads'] = 2\n", " 'num_heads': 2\n",
"params['num_layers'] = 2\n", " 'num_layers': 2\n",
"params['batch_size'] = 2\n", " 'batch_size': 2\n",
"params['n_dims'] = 256\n", " 'n_dims': 256\n",
"params['max_decode_length'] = 4" " 'max_decode_length': 4}"
] ]
}, },
{ {
"cell_type": "markdown", "cell_type": "markdown",
"metadata": { "metadata": {
"id": "UGvmd0_dRFYI" "id": "CYXkoplAij01"
}, },
"source": [ "source": [
"## What is a Cache?\n", "## Initialize cache. "
"In auto-regressive architectures like Transformer based [Encoder-Decoder](https://arxiv.org/abs/1706.03762) models, \n",
"Cache is used for fast sequential decoding.\n",
"It is a nested dictionary storing pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention blocks) for every layer.\n",
"\n",
"```\n",
"{\n",
" 'layer_%d' % layer: {\n",
" 'k': tf.zeros([params['batch_size'], params['max_decode_length'], params['num_heads'], params['n_dims']/params['num_heads']], dtype=tf.float32),\n",
" 'v': tf.zeros([params['batch_size'], params['max_decode_length'], params['num_heads'], params['n_dims']/params['num_heads']], dtype=tf.float32)\n",
" } for layer in range(params['num_layers']),\n",
" 'model_specific_item' : Model specific tensor shape,\n",
"}\n",
"\n",
"```"
] ]
}, },
{ {
"cell_type": "markdown", "cell_type": "markdown",
"metadata": { "metadata": {
"id": "CYXkoplAij01" "id": "UGvmd0_dRFYI"
}, },
"source": [ "source": [
"# Initialize cache. " "In auto-regressive architectures like Transformer based [Encoder-Decoder](https://arxiv.org/abs/1706.03762) models, \n",
"Cache is used for fast sequential decoding.\n",
"It is a nested dictionary storing pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention blocks) for every layer."
] ]
}, },
{ {
...@@ -243,35 +263,15 @@ ...@@ -243,35 +263,15 @@
"source": [ "source": [
"cache = {\n", "cache = {\n",
" 'layer_%d' % layer: {\n", " 'layer_%d' % layer: {\n",
" 'k': tf.zeros([params['batch_size'], params['max_decode_length'], params['num_heads'], params['n_dims']/params['num_heads']], dtype=tf.float32),\n", " 'k': tf.zeros(\n",
" 'v': tf.zeros([params['batch_size'], params['max_decode_length'], params['num_heads'], params['n_dims']/params['num_heads']], dtype=tf.float32)\n", " shape=[params['batch_size'], params['max_decode_length'], params['num_heads'], params['n_dims'] // params['num_heads']],\n",
" dtype=tf.float32),\n",
" 'v': tf.zeros(\n",
" shape=[params['batch_size'], params['max_decode_length'], params['num_heads'], params['n_dims'] // params['num_heads']],\n",
" dtype=tf.float32)\n",
" } for layer in range(params['num_layers'])\n", " } for layer in range(params['num_layers'])\n",
" }\n", " }\n",
"print(\"cache key shape for layer 1 :\", cache['layer_1']['k'].shape)" "print(\"cache value shape for layer 1 :\", cache['layer_1']['k'].shape)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "nNY3Xn8SiblP"
},
"source": [
"# Define closure for length normalization. **optional.**\n",
"\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "T92ccAzlnGqh"
},
"outputs": [],
"source": [
"def length_norm(length, dtype):\n",
" \"\"\"Return length normalization factor.\"\"\"\n",
" return tf.pow(((5. + tf.cast(length, dtype)) / 6.), 0.0)"
] ]
}, },
{ {
...@@ -280,15 +280,14 @@ ...@@ -280,15 +280,14 @@
"id": "syl7I5nURPgW" "id": "syl7I5nURPgW"
}, },
"source": [ "source": [
"# Create model_fn\n", "### Create model_fn\n",
" In practice, this will be replaced by an actual model implementation such as [here](https://github.com/tensorflow/models/blob/master/official/nlp/transformer/transformer.py#L236)\n", " In practice, this will be replaced by an actual model implementation such as [here](https://github.com/tensorflow/models/blob/master/official/nlp/transformer/transformer.py#L236)\n",
"```\n", "```\n",
"Args:\n", "Args:\n",
"i : Step that is being decoded.\n", "i : Step that is being decoded.\n",
"Returns:\n", "Returns:\n",
" logit probabilities of size [batch_size, 1, vocab_size]\n", " logit probabilities of size [batch_size, 1, vocab_size]\n",
"```\n", "```\n"
"\n"
] ]
}, },
{ {
...@@ -307,15 +306,6 @@ ...@@ -307,15 +306,6 @@
" return probabilities[:, i, :]" " return probabilities[:, i, :]"
] ]
}, },
{
"cell_type": "markdown",
"metadata": {
"id": "DBMUkaVmVZBg"
},
"source": [
"# Initialize symbols_to_logits_fn\n"
]
},
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": null, "execution_count": null,
...@@ -339,7 +329,7 @@ ...@@ -339,7 +329,7 @@
"id": "R_tV3jyWVL47" "id": "R_tV3jyWVL47"
}, },
"source": [ "source": [
"# Greedy \n", "## Greedy \n",
"Greedy decoding selects the token id with the highest probability as its next id: $id_t = argmax_{w}P(id | id_{1:t-1})$ at each timestep $t$. The following sketch shows greedy decoding. " "Greedy decoding selects the token id with the highest probability as its next id: $id_t = argmax_{w}P(id | id_{1:t-1})$ at each timestep $t$. The following sketch shows greedy decoding. "
] ]
}, },
...@@ -370,7 +360,7 @@ ...@@ -370,7 +360,7 @@
"id": "s4pTTsQXVz5O" "id": "s4pTTsQXVz5O"
}, },
"source": [ "source": [
"# top_k sampling\n", "## top_k sampling\n",
"In *Top-K* sampling, the *K* most likely next token ids are filtered and the probability mass is redistributed among only those *K* ids. " "In *Top-K* sampling, the *K* most likely next token ids are filtered and the probability mass is redistributed among only those *K* ids. "
] ]
}, },
...@@ -404,7 +394,7 @@ ...@@ -404,7 +394,7 @@
"id": "Jp3G-eE_WI4Y" "id": "Jp3G-eE_WI4Y"
}, },
"source": [ "source": [
"# top_p sampling\n", "## top_p sampling\n",
"Instead of sampling only from the most likely *K* token ids, in *Top-p* sampling chooses from the smallest possible set of ids whose cumulative probability exceeds the probability *p*." "Instead of sampling only from the most likely *K* token ids, in *Top-p* sampling chooses from the smallest possible set of ids whose cumulative probability exceeds the probability *p*."
] ]
}, },
...@@ -438,7 +428,7 @@ ...@@ -438,7 +428,7 @@
"id": "2hcuyJ2VWjDz" "id": "2hcuyJ2VWjDz"
}, },
"source": [ "source": [
"# Beam search decoding\n", "## Beam search decoding\n",
"Beam search reduces the risk of missing hidden high probability token ids by keeping the most likely num_beams of hypotheses at each time step and eventually choosing the hypothesis that has the overall highest probability. " "Beam search reduces the risk of missing hidden high probability token ids by keeping the most likely num_beams of hypotheses at each time step and eventually choosing the hypothesis that has the overall highest probability. "
] ]
}, },
......
{ {
"nbformat": 4,
"nbformat_minor": 0,
"metadata": {
"colab": {
"name": "Customizing a Transformer Encoder",
"private_outputs": true,
"provenance": [],
"collapsed_sections": [],
"toc_visible": true
},
"kernelspec": {
"display_name": "Python 3",
"name": "python3"
}
},
"cells": [ "cells": [
{ {
"cell_type": "markdown", "cell_type": "markdown",
...@@ -26,10 +11,12 @@ ...@@ -26,10 +11,12 @@
}, },
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": null,
"metadata": { "metadata": {
"cellView": "form", "cellView": "form",
"id": "rxPj2Lsni9O4" "id": "rxPj2Lsni9O4"
}, },
"outputs": [],
"source": [ "source": [
"#@title Licensed under the Apache License, Version 2.0 (the \"License\");\n", "#@title Licensed under the Apache License, Version 2.0 (the \"License\");\n",
"# you may not use this file except in compliance with the License.\n", "# you may not use this file except in compliance with the License.\n",
...@@ -42,9 +29,7 @@ ...@@ -42,9 +29,7 @@
"# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
"# See the License for the specific language governing permissions and\n", "# See the License for the specific language governing permissions and\n",
"# limitations under the License." "# limitations under the License."
], ]
"execution_count": null,
"outputs": []
}, },
{ {
"cell_type": "markdown", "cell_type": "markdown",
...@@ -61,20 +46,20 @@ ...@@ -61,20 +46,20 @@
"id": "Mwb9uw1cDXsa" "id": "Mwb9uw1cDXsa"
}, },
"source": [ "source": [
"<table class=\"tfo-notebook-buttons\" align=\"left\">\n", "\u003ctable class=\"tfo-notebook-buttons\" align=\"left\"\u003e\n",
" <td>\n", " \u003ctd\u003e\n",
" <a target=\"_blank\" href=\"https://www.tensorflow.org/official_models/nlp/customize_encoder\"><img src=\"https://www.tensorflow.org/images/tf_logo_32px.png\" />View on TensorFlow.org</a>\n", " \u003ca target=\"_blank\" href=\"https://www.tensorflow.org/official_models/nlp/customize_encoder\"\u003e\u003cimg src=\"https://www.tensorflow.org/images/tf_logo_32px.png\" /\u003eView on TensorFlow.org\u003c/a\u003e\n",
" </td>\n", " \u003c/td\u003e\n",
" <td>\n", " \u003ctd\u003e\n",
" <a target=\"_blank\" href=\"https://colab.research.google.com/github/tensorflow/models/blob/master/official/colab/nlp/customize_encoder.ipynb\"><img src=\"https://www.tensorflow.org/images/colab_logo_32px.png\" />Run in Google Colab</a>\n", " \u003ca target=\"_blank\" href=\"https://colab.research.google.com/github/tensorflow/models/blob/master/official/colab/nlp/customize_encoder.ipynb\"\u003e\u003cimg src=\"https://www.tensorflow.org/images/colab_logo_32px.png\" /\u003eRun in Google Colab\u003c/a\u003e\n",
" </td>\n", " \u003c/td\u003e\n",
" <td>\n", " \u003ctd\u003e\n",
" <a target=\"_blank\" href=\"https://github.com/tensorflow/models/blob/master/official/colab/nlp/customize_encoder.ipynb\"><img src=\"https://www.tensorflow.org/images/GitHub-Mark-32px.png\" />View source on GitHub</a>\n", " \u003ca target=\"_blank\" href=\"https://github.com/tensorflow/models/blob/master/official/colab/nlp/customize_encoder.ipynb\"\u003e\u003cimg src=\"https://www.tensorflow.org/images/GitHub-Mark-32px.png\" /\u003eView source on GitHub\u003c/a\u003e\n",
" </td>\n", " \u003c/td\u003e\n",
" <td>\n", " \u003ctd\u003e\n",
" <a href=\"https://storage.googleapis.com/tensorflow_docs/models/official/colab/nlp/customize_encoder.ipynb\"><img src=\"https://www.tensorflow.org/images/download_logo_32px.png\" />Download notebook</a>\n", " \u003ca href=\"https://storage.googleapis.com/tensorflow_docs/models/official/colab/nlp/customize_encoder.ipynb\"\u003e\u003cimg src=\"https://www.tensorflow.org/images/download_logo_32px.png\" /\u003eDownload notebook\u003c/a\u003e\n",
" </td>\n", " \u003c/td\u003e\n",
"</table>" "\u003c/table\u003e"
] ]
}, },
{ {
...@@ -87,7 +72,7 @@ ...@@ -87,7 +72,7 @@
"\n", "\n",
"The [TensorFlow Models NLP library](https://github.com/tensorflow/models/tree/master/official/nlp/modeling) is a collection of tools for building and training modern high performance natural language models.\n", "The [TensorFlow Models NLP library](https://github.com/tensorflow/models/tree/master/official/nlp/modeling) is a collection of tools for building and training modern high performance natural language models.\n",
"\n", "\n",
"The [TransformEncoder](https://github.com/tensorflow/models/blob/master/official/nlp/modeling/networks/encoder_scaffold.py) is the core of this library, and lots of new network architectures are proposed to improve the encoder. In this Colab notebook, we will learn how to customize the encoder to employ new network architectures." "The `tfm.nlp.networks.EncoderScaffold` is the core of this library, and lots of new network architectures are proposed to improve the encoder. In this Colab notebook, we will learn how to customize the encoder to employ new network architectures."
] ]
}, },
{ {
...@@ -114,14 +99,27 @@ ...@@ -114,14 +99,27 @@
}, },
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": null,
"metadata": { "metadata": {
"id": "thsKZDjhswhR" "id": "mfHI5JyuJ1y9"
}, },
"outputs": [],
"source": [ "source": [
"!pip install -q tf-models-official==2.4.0" "# Uninstall colab's opencv-python, it conflicts with `opencv-python-headless`\n",
], "# which is installed by tf-models-official\n",
"!pip uninstall -y opencv-python"
]
},
{
"cell_type": "code",
"execution_count": null, "execution_count": null,
"outputs": [] "metadata": {
"id": "thsKZDjhswhR"
},
"outputs": [],
"source": [
"!pip install -q tf-models-nightly"
]
}, },
{ {
"cell_type": "markdown", "cell_type": "markdown",
...@@ -134,19 +132,18 @@ ...@@ -134,19 +132,18 @@
}, },
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": null,
"metadata": { "metadata": {
"id": "my4dp-RMssQe" "id": "my4dp-RMssQe"
}, },
"outputs": [],
"source": [ "source": [
"import numpy as np\n", "import numpy as np\n",
"import tensorflow as tf\n", "import tensorflow as tf\n",
"\n", "\n",
"from official.modeling import activations\n", "import tensorflow_models as tfm\n",
"from official.nlp import modeling\n", "from tensorflow_models import nlp"
"from official.nlp.modeling import layers, losses, models, networks" ]
],
"execution_count": null,
"outputs": []
}, },
{ {
"cell_type": "markdown", "cell_type": "markdown",
...@@ -156,14 +153,16 @@ ...@@ -156,14 +153,16 @@
"source": [ "source": [
"## Canonical BERT encoder\n", "## Canonical BERT encoder\n",
"\n", "\n",
"Before learning how to customize the encoder, let's firstly create a canonical BERT enoder and use it to instantiate a `BertClassifier` for classification task." "Before learning how to customize the encoder, let's firstly create a canonical BERT enoder and use it to instantiate a `bert_classifier.BertClassifier` for classification task."
] ]
}, },
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": null,
"metadata": { "metadata": {
"id": "Oav8sbgstWc-" "id": "Oav8sbgstWc-"
}, },
"outputs": [],
"source": [ "source": [
"cfg = {\n", "cfg = {\n",
" \"vocab_size\": 100,\n", " \"vocab_size\": 100,\n",
...@@ -171,22 +170,20 @@ ...@@ -171,22 +170,20 @@
" \"num_layers\": 3,\n", " \"num_layers\": 3,\n",
" \"num_attention_heads\": 4,\n", " \"num_attention_heads\": 4,\n",
" \"intermediate_size\": 64,\n", " \"intermediate_size\": 64,\n",
" \"activation\": activations.gelu,\n", " \"activation\": tfm.utils.activations.gelu,\n",
" \"dropout_rate\": 0.1,\n", " \"dropout_rate\": 0.1,\n",
" \"attention_dropout_rate\": 0.1,\n", " \"attention_dropout_rate\": 0.1,\n",
" \"max_sequence_length\": 16,\n", " \"max_sequence_length\": 16,\n",
" \"type_vocab_size\": 2,\n", " \"type_vocab_size\": 2,\n",
" \"initializer\": tf.keras.initializers.TruncatedNormal(stddev=0.02),\n", " \"initializer\": tf.keras.initializers.TruncatedNormal(stddev=0.02),\n",
"}\n", "}\n",
"bert_encoder = modeling.networks.BertEncoder(**cfg)\n", "bert_encoder = nlp.networks.BertEncoder(**cfg)\n",
"\n", "\n",
"def build_classifier(bert_encoder):\n", "def build_classifier(bert_encoder):\n",
" return modeling.models.BertClassifier(bert_encoder, num_classes=2)\n", " return nlp.models.BertClassifier(bert_encoder, num_classes=2)\n",
"\n", "\n",
"canonical_classifier_model = build_classifier(bert_encoder)" "canonical_classifier_model = build_classifier(bert_encoder)"
], ]
"execution_count": null,
"outputs": []
}, },
{ {
"cell_type": "markdown", "cell_type": "markdown",
...@@ -201,9 +198,11 @@ ...@@ -201,9 +198,11 @@
}, },
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": null,
"metadata": { "metadata": {
"id": "csED2d-Yt5h6" "id": "csED2d-Yt5h6"
}, },
"outputs": [],
"source": [ "source": [
"def predict(model):\n", "def predict(model):\n",
" batch_size = 3\n", " batch_size = 3\n",
...@@ -216,9 +215,7 @@ ...@@ -216,9 +215,7 @@
" print(model([word_ids, mask, type_ids], training=False))\n", " print(model([word_ids, mask, type_ids], training=False))\n",
"\n", "\n",
"predict(canonical_classifier_model)" "predict(canonical_classifier_model)"
], ]
"execution_count": null,
"outputs": []
}, },
{ {
"cell_type": "markdown", "cell_type": "markdown",
...@@ -249,7 +246,7 @@ ...@@ -249,7 +246,7 @@
"source": [ "source": [
"### Use EncoderScaffold\n", "### Use EncoderScaffold\n",
"\n", "\n",
"`EncoderScaffold` allows users to provide a custom embedding subnetwork\n", "`networks.EncoderScaffold` allows users to provide a custom embedding subnetwork\n",
" (which will replace the standard embedding logic) and/or a custom hidden layer class (which will replace the `Transformer` instantiation in the encoder)." " (which will replace the standard embedding logic) and/or a custom hidden layer class (which will replace the `Transformer` instantiation in the encoder)."
] ]
}, },
...@@ -261,30 +258,32 @@ ...@@ -261,30 +258,32 @@
"source": [ "source": [
"#### Without Customization\n", "#### Without Customization\n",
"\n", "\n",
"Without any customization, `EncoderScaffold` behaves the same the canonical `BertEncoder`.\n", "Without any customization, `networks.EncoderScaffold` behaves the same the canonical `networks.BertEncoder`.\n",
"\n", "\n",
"As shown in the following example, `EncoderScaffold` can load `BertEncoder`'s weights and output the same values:" "As shown in the following example, `networks.EncoderScaffold` can load `networks.BertEncoder`'s weights and output the same values:"
] ]
}, },
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": null,
"metadata": { "metadata": {
"id": "ktNzKuVByZQf" "id": "ktNzKuVByZQf"
}, },
"outputs": [],
"source": [ "source": [
"default_hidden_cfg = dict(\n", "default_hidden_cfg = dict(\n",
" num_attention_heads=cfg[\"num_attention_heads\"],\n", " num_attention_heads=cfg[\"num_attention_heads\"],\n",
" intermediate_size=cfg[\"intermediate_size\"],\n", " intermediate_size=cfg[\"intermediate_size\"],\n",
" intermediate_activation=activations.gelu,\n", " intermediate_activation=cfg[\"activation\"],\n",
" dropout_rate=cfg[\"dropout_rate\"],\n", " dropout_rate=cfg[\"dropout_rate\"],\n",
" attention_dropout_rate=cfg[\"attention_dropout_rate\"],\n", " attention_dropout_rate=cfg[\"attention_dropout_rate\"],\n",
" kernel_initializer=tf.keras.initializers.TruncatedNormal(0.02),\n", " kernel_initializer=cfg[\"initializer\"],\n",
")\n", ")\n",
"default_embedding_cfg = dict(\n", "default_embedding_cfg = dict(\n",
" vocab_size=cfg[\"vocab_size\"],\n", " vocab_size=cfg[\"vocab_size\"],\n",
" type_vocab_size=cfg[\"type_vocab_size\"],\n", " type_vocab_size=cfg[\"type_vocab_size\"],\n",
" hidden_size=cfg[\"hidden_size\"],\n", " hidden_size=cfg[\"hidden_size\"],\n",
" initializer=tf.keras.initializers.TruncatedNormal(0.02),\n", " initializer=cfg[\"initializer\"],\n",
" dropout_rate=cfg[\"dropout_rate\"],\n", " dropout_rate=cfg[\"dropout_rate\"],\n",
" max_seq_length=cfg[\"max_sequence_length\"]\n", " max_seq_length=cfg[\"max_sequence_length\"]\n",
")\n", ")\n",
...@@ -294,17 +293,15 @@ ...@@ -294,17 +293,15 @@
" num_hidden_instances=cfg[\"num_layers\"],\n", " num_hidden_instances=cfg[\"num_layers\"],\n",
" pooled_output_dim=cfg[\"hidden_size\"],\n", " pooled_output_dim=cfg[\"hidden_size\"],\n",
" return_all_layer_outputs=True,\n", " return_all_layer_outputs=True,\n",
" pooler_layer_initializer=tf.keras.initializers.TruncatedNormal(0.02),\n", " pooler_layer_initializer=cfg[\"initializer\"],\n",
")\n", ")\n",
"\n", "\n",
"encoder_scaffold = modeling.networks.EncoderScaffold(**default_kwargs)\n", "encoder_scaffold = nlp.networks.EncoderScaffold(**default_kwargs)\n",
"classifier_model_from_encoder_scaffold = build_classifier(encoder_scaffold)\n", "classifier_model_from_encoder_scaffold = build_classifier(encoder_scaffold)\n",
"classifier_model_from_encoder_scaffold.set_weights(\n", "classifier_model_from_encoder_scaffold.set_weights(\n",
" canonical_classifier_model.get_weights())\n", " canonical_classifier_model.get_weights())\n",
"predict(classifier_model_from_encoder_scaffold)" "predict(classifier_model_from_encoder_scaffold)"
], ]
"execution_count": null,
"outputs": []
}, },
{ {
"cell_type": "markdown", "cell_type": "markdown",
...@@ -321,26 +318,26 @@ ...@@ -321,26 +318,26 @@
}, },
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": null,
"metadata": { "metadata": {
"id": "LTinnaG6vcsw" "id": "LTinnaG6vcsw"
}, },
"outputs": [],
"source": [ "source": [
"word_ids = tf.keras.layers.Input(\n", "word_ids = tf.keras.layers.Input(\n",
" shape=(cfg['max_sequence_length'],), dtype=tf.int32, name=\"input_word_ids\")\n", " shape=(cfg['max_sequence_length'],), dtype=tf.int32, name=\"input_word_ids\")\n",
"mask = tf.keras.layers.Input(\n", "mask = tf.keras.layers.Input(\n",
" shape=(cfg['max_sequence_length'],), dtype=tf.int32, name=\"input_mask\")\n", " shape=(cfg['max_sequence_length'],), dtype=tf.int32, name=\"input_mask\")\n",
"embedding_layer = modeling.layers.OnDeviceEmbedding(\n", "embedding_layer = nlp.layers.OnDeviceEmbedding(\n",
" vocab_size=cfg['vocab_size'],\n", " vocab_size=cfg['vocab_size'],\n",
" embedding_width=cfg['hidden_size'],\n", " embedding_width=cfg['hidden_size'],\n",
" initializer=tf.keras.initializers.TruncatedNormal(stddev=0.02),\n", " initializer=cfg[\"initializer\"],\n",
" name=\"word_embeddings\")\n", " name=\"word_embeddings\")\n",
"word_embeddings = embedding_layer(word_ids)\n", "word_embeddings = embedding_layer(word_ids)\n",
"attention_mask = layers.SelfAttentionMask()([word_embeddings, mask])\n", "attention_mask = nlp.layers.SelfAttentionMask()([word_embeddings, mask])\n",
"new_embedding_network = tf.keras.Model([word_ids, mask],\n", "new_embedding_network = tf.keras.Model([word_ids, mask],\n",
" [word_embeddings, attention_mask])" " [word_embeddings, attention_mask])"
], ]
"execution_count": null,
"outputs": []
}, },
{ {
"cell_type": "markdown", "cell_type": "markdown",
...@@ -354,14 +351,14 @@ ...@@ -354,14 +351,14 @@
}, },
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": null,
"metadata": { "metadata": {
"id": "fO9zKFE4OpHp" "id": "fO9zKFE4OpHp"
}, },
"outputs": [],
"source": [ "source": [
"tf.keras.utils.plot_model(new_embedding_network, show_shapes=True, dpi=48)" "tf.keras.utils.plot_model(new_embedding_network, show_shapes=True, dpi=48)"
], ]
"execution_count": null,
"outputs": []
}, },
{ {
"cell_type": "markdown", "cell_type": "markdown",
...@@ -374,9 +371,11 @@ ...@@ -374,9 +371,11 @@
}, },
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": null,
"metadata": { "metadata": {
"id": "mtFDMNf2vIl9" "id": "mtFDMNf2vIl9"
}, },
"outputs": [],
"source": [ "source": [
"kwargs = dict(default_kwargs)\n", "kwargs = dict(default_kwargs)\n",
"\n", "\n",
...@@ -384,16 +383,14 @@ ...@@ -384,16 +383,14 @@
"kwargs['embedding_cls'] = new_embedding_network\n", "kwargs['embedding_cls'] = new_embedding_network\n",
"kwargs['embedding_data'] = embedding_layer.embeddings\n", "kwargs['embedding_data'] = embedding_layer.embeddings\n",
"\n", "\n",
"encoder_with_customized_embedding = modeling.networks.EncoderScaffold(**kwargs)\n", "encoder_with_customized_embedding = nlp.networks.EncoderScaffold(**kwargs)\n",
"classifier_model = build_classifier(encoder_with_customized_embedding)\n", "classifier_model = build_classifier(encoder_with_customized_embedding)\n",
"# ... Train the model ...\n", "# ... Train the model ...\n",
"print(classifier_model.inputs)\n", "print(classifier_model.inputs)\n",
"\n", "\n",
"# Assert that there are only two inputs.\n", "# Assert that there are only two inputs.\n",
"assert len(classifier_model.inputs) == 2" "assert len(classifier_model.inputs) == 2"
], ]
"execution_count": null,
"outputs": []
}, },
{ {
"cell_type": "markdown", "cell_type": "markdown",
...@@ -403,34 +400,34 @@ ...@@ -403,34 +400,34 @@
"source": [ "source": [
"#### Customized Transformer\n", "#### Customized Transformer\n",
"\n", "\n",
"User can also override the [hidden_cls](https://github.com/tensorflow/models/blob/master/official/nlp/modeling/networks/encoder_scaffold.py#L103) argument in `EncoderScaffold`'s constructor to employ a customized Transformer layer.\n", "User can also override the `hidden_cls` argument in `networks.EncoderScaffold`'s constructor to employ a customized Transformer layer.\n",
"\n", "\n",
"See [ReZeroTransformer](https://github.com/tensorflow/models/blob/master/official/nlp/modeling/layers/rezero_transformer.py) for how to implement a customized Transformer layer.\n", "See [the source of `nlp.layers.ReZeroTransformer`](https://github.com/tensorflow/models/blob/master/official/nlp/modeling/layers/rezero_transformer.py) for how to implement a customized Transformer layer.\n",
"\n", "\n",
"Following is an example of using `ReZeroTransformer`:\n" "Following is an example of using `nlp.layers.ReZeroTransformer`:\n"
] ]
}, },
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": null,
"metadata": { "metadata": {
"id": "uAIarLZgw6pA" "id": "uAIarLZgw6pA"
}, },
"outputs": [],
"source": [ "source": [
"kwargs = dict(default_kwargs)\n", "kwargs = dict(default_kwargs)\n",
"\n", "\n",
"# Use ReZeroTransformer.\n", "# Use ReZeroTransformer.\n",
"kwargs['hidden_cls'] = modeling.layers.ReZeroTransformer\n", "kwargs['hidden_cls'] = nlp.layers.ReZeroTransformer\n",
"\n", "\n",
"encoder_with_rezero_transformer = modeling.networks.EncoderScaffold(**kwargs)\n", "encoder_with_rezero_transformer = nlp.networks.EncoderScaffold(**kwargs)\n",
"classifier_model = build_classifier(encoder_with_rezero_transformer)\n", "classifier_model = build_classifier(encoder_with_rezero_transformer)\n",
"# ... Train the model ...\n", "# ... Train the model ...\n",
"predict(classifier_model)\n", "predict(classifier_model)\n",
"\n", "\n",
"# Assert that the variable `rezero_alpha` from ReZeroTransformer exists.\n", "# Assert that the variable `rezero_alpha` from ReZeroTransformer exists.\n",
"assert 'rezero_alpha' in ''.join([x.name for x in classifier_model.trainable_weights])" "assert 'rezero_alpha' in ''.join([x.name for x in classifier_model.trainable_weights])"
], ]
"execution_count": null,
"outputs": []
}, },
{ {
"cell_type": "markdown", "cell_type": "markdown",
...@@ -438,10 +435,9 @@ ...@@ -438,10 +435,9 @@
"id": "6PMHFdvnxvR0" "id": "6PMHFdvnxvR0"
}, },
"source": [ "source": [
"### Use [TransformerScaffold](https://github.com/tensorflow/models/blob/master/official/nlp/modeling/layers/transformer_scaffold.py)\n", "### Use `nlp.layers.TransformerScaffold`\n",
"\n", "\n",
"The above method of customizing `Transformer` requires rewriting the whole `Transformer` layer, while sometimes you may only want to customize either attention layer or feedforward block. In this case, [TransformerScaffold](https://github.com/tensorflow/models/blob/master/official/nlp/modeling/layers/transformer_scaffold.py) can be used.\n", "The above method of customizing the model requires rewriting the whole `nlp.layers.Transformer` layer, while sometimes you may only want to customize either attention layer or feedforward block. In this case, `nlp.layers.TransformerScaffold` can be used.\n"
"\n"
] ]
}, },
{ {
...@@ -452,37 +448,48 @@ ...@@ -452,37 +448,48 @@
"source": [ "source": [
"#### Customize Attention Layer\n", "#### Customize Attention Layer\n",
"\n", "\n",
"User can also override the [attention_cls](https://github.com/tensorflow/models/blob/master/official/nlp/modeling/layers/transformer_scaffold.py#L45) argument in `TransformerScaffold`'s constructor to employ a customized Attention layer.\n", "User can also override the `attention_cls` argument in `layers.TransformerScaffold`'s constructor to employ a customized Attention layer.\n",
"\n", "\n",
"See [TalkingHeadsAttention](https://github.com/tensorflow/models/blob/master/official/nlp/modeling/layers/talking_heads_attention.py) for how to implement a customized `Attention` layer.\n", "See [the source of `nlp.layers.TalkingHeadsAttention`](https://github.com/tensorflow/models/blob/master/official/nlp/modeling/layers/talking_heads_attention.py) for how to implement a customized `Attention` layer.\n",
"\n", "\n",
"Following is an example of using [TalkingHeadsAttention](https://github.com/tensorflow/models/blob/master/official/nlp/modeling/layers/talking_heads_attention.py):" "Following is an example of using `nlp.layers.TalkingHeadsAttention`:"
] ]
}, },
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": null,
"metadata": { "metadata": {
"id": "nFrSMrZuyNeQ" "id": "nFrSMrZuyNeQ"
}, },
"outputs": [],
"source": [ "source": [
"# Use TalkingHeadsAttention\n", "# Use TalkingHeadsAttention\n",
"hidden_cfg = dict(default_hidden_cfg)\n", "hidden_cfg = dict(default_hidden_cfg)\n",
"hidden_cfg['attention_cls'] = modeling.layers.TalkingHeadsAttention\n", "hidden_cfg['attention_cls'] = nlp.layers.TalkingHeadsAttention\n",
"\n", "\n",
"kwargs = dict(default_kwargs)\n", "kwargs = dict(default_kwargs)\n",
"kwargs['hidden_cls'] = modeling.layers.TransformerScaffold\n", "kwargs['hidden_cls'] = nlp.layers.TransformerScaffold\n",
"kwargs['hidden_cfg'] = hidden_cfg\n", "kwargs['hidden_cfg'] = hidden_cfg\n",
"\n", "\n",
"encoder = modeling.networks.EncoderScaffold(**kwargs)\n", "encoder = nlp.networks.EncoderScaffold(**kwargs)\n",
"classifier_model = build_classifier(encoder)\n", "classifier_model = build_classifier(encoder)\n",
"# ... Train the model ...\n", "# ... Train the model ...\n",
"predict(classifier_model)\n", "predict(classifier_model)\n",
"\n", "\n",
"# Assert that the variable `pre_softmax_weight` from TalkingHeadsAttention exists.\n", "# Assert that the variable `pre_softmax_weight` from TalkingHeadsAttention exists.\n",
"assert 'pre_softmax_weight' in ''.join([x.name for x in classifier_model.trainable_weights])" "assert 'pre_softmax_weight' in ''.join([x.name for x in classifier_model.trainable_weights])"
], ]
},
{
"cell_type": "code",
"execution_count": null, "execution_count": null,
"outputs": [] "metadata": {
"id": "tKkZ8spzYmpc"
},
"outputs": [],
"source": [
"tf.keras.utils.plot_model(encoder_with_rezero_transformer, show_shapes=True, dpi=48)"
]
}, },
{ {
"cell_type": "markdown", "cell_type": "markdown",
...@@ -494,35 +501,35 @@ ...@@ -494,35 +501,35 @@
"\n", "\n",
"Similiarly, one could also customize the feedforward layer.\n", "Similiarly, one could also customize the feedforward layer.\n",
"\n", "\n",
"See [GatedFeedforward](https://github.com/tensorflow/models/blob/master/official/nlp/modeling/layers/gated_feedforward.py) for how to implement a customized feedforward layer.\n", "See [the source of `nlp.layers.GatedFeedforward`](https://github.com/tensorflow/models/blob/master/official/nlp/modeling/layers/gated_feedforward.py) for how to implement a customized feedforward layer.\n",
"\n", "\n",
"Following is an example of using [GatedFeedforward](https://github.com/tensorflow/models/blob/master/official/nlp/modeling/layers/gated_feedforward.py)." "Following is an example of using `nlp.layers.GatedFeedforward`:"
] ]
}, },
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": null,
"metadata": { "metadata": {
"id": "XAbKy_l4y_-i" "id": "XAbKy_l4y_-i"
}, },
"outputs": [],
"source": [ "source": [
"# Use TalkingHeadsAttention\n", "# Use GatedFeedforward\n",
"hidden_cfg = dict(default_hidden_cfg)\n", "hidden_cfg = dict(default_hidden_cfg)\n",
"hidden_cfg['feedforward_cls'] = modeling.layers.GatedFeedforward\n", "hidden_cfg['feedforward_cls'] = nlp.layers.GatedFeedforward\n",
"\n", "\n",
"kwargs = dict(default_kwargs)\n", "kwargs = dict(default_kwargs)\n",
"kwargs['hidden_cls'] = modeling.layers.TransformerScaffold\n", "kwargs['hidden_cls'] = nlp.layers.TransformerScaffold\n",
"kwargs['hidden_cfg'] = hidden_cfg\n", "kwargs['hidden_cfg'] = hidden_cfg\n",
"\n", "\n",
"encoder_with_gated_feedforward = modeling.networks.EncoderScaffold(**kwargs)\n", "encoder_with_gated_feedforward = nlp.networks.EncoderScaffold(**kwargs)\n",
"classifier_model = build_classifier(encoder_with_gated_feedforward)\n", "classifier_model = build_classifier(encoder_with_gated_feedforward)\n",
"# ... Train the model ...\n", "# ... Train the model ...\n",
"predict(classifier_model)\n", "predict(classifier_model)\n",
"\n", "\n",
"# Assert that the variable `gate` from GatedFeedforward exists.\n", "# Assert that the variable `gate` from GatedFeedforward exists.\n",
"assert 'gate' in ''.join([x.name for x in classifier_model.trainable_weights])" "assert 'gate' in ''.join([x.name for x in classifier_model.trainable_weights])"
], ]
"execution_count": null,
"outputs": []
}, },
{ {
"cell_type": "markdown", "cell_type": "markdown",
...@@ -530,26 +537,28 @@ ...@@ -530,26 +537,28 @@
"id": "a_8NWUhkzeAq" "id": "a_8NWUhkzeAq"
}, },
"source": [ "source": [
"### Build a new Encoder using building blocks from KerasBERT.\n", "### Build a new Encoder\n",
"\n", "\n",
"Finally, you could also build a new encoder using building blocks in the modeling library.\n", "Finally, you could also build a new encoder using building blocks in the modeling library.\n",
"\n", "\n",
"See [AlbertEncoder](https://github.com/tensorflow/models/blob/master/official/nlp/modeling/networks/albert_encoder.py) as an example:\n" "See [the source for `nlp.networks.AlbertEncoder`](https://github.com/tensorflow/models/blob/master/official/nlp/modeling/networks/albert_encoder.py) as an example of how to du this. \n",
"\n",
"Here is an example using `nlp.networks.AlbertEncoder`:\n"
] ]
}, },
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": null,
"metadata": { "metadata": {
"id": "xsiA3RzUzmUM" "id": "xsiA3RzUzmUM"
}, },
"outputs": [],
"source": [ "source": [
"albert_encoder = modeling.networks.AlbertEncoder(**cfg)\n", "albert_encoder = nlp.networks.AlbertEncoder(**cfg)\n",
"classifier_model = build_classifier(albert_encoder)\n", "classifier_model = build_classifier(albert_encoder)\n",
"# ... Train the model ...\n", "# ... Train the model ...\n",
"predict(classifier_model)" "predict(classifier_model)"
], ]
"execution_count": null,
"outputs": []
}, },
{ {
"cell_type": "markdown", "cell_type": "markdown",
...@@ -562,14 +571,28 @@ ...@@ -562,14 +571,28 @@
}, },
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": null,
"metadata": { "metadata": {
"id": "Uv_juT22HERW" "id": "Uv_juT22HERW"
}, },
"outputs": [],
"source": [ "source": [
"tf.keras.utils.plot_model(albert_encoder, show_shapes=True, dpi=48)" "tf.keras.utils.plot_model(albert_encoder, show_shapes=True, dpi=48)"
]
}
], ],
"execution_count": null, "metadata": {
"outputs": [] "colab": {
"collapsed_sections": [],
"name": "customize_encoder.ipynb",
"provenance": [],
"toc_visible": true
},
"kernelspec": {
"display_name": "Python 3",
"name": "python3"
} }
] },
"nbformat": 4,
"nbformat_minor": 0
} }
...@@ -95,6 +95,19 @@ ...@@ -95,6 +95,19 @@
"* `pip` will install all models and dependencies automatically." "* `pip` will install all models and dependencies automatically."
] ]
}, },
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "IAOmYthAzI7J"
},
"outputs": [],
"source": [
"# Uninstall colab's opencv-python, it conflicts with `opencv-python-headless`\n",
"# which is installed by tf-models-official\n",
"!pip uninstall -y opencv-python"
]
},
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": null, "execution_count": null,
...@@ -103,7 +116,7 @@ ...@@ -103,7 +116,7 @@
}, },
"outputs": [], "outputs": [],
"source": [ "source": [
"!pip install -q tf-models-official==2.4.0" "!pip install tf-models-official"
] ]
}, },
{ {
...@@ -126,8 +139,7 @@ ...@@ -126,8 +139,7 @@
"import numpy as np\n", "import numpy as np\n",
"import tensorflow as tf\n", "import tensorflow as tf\n",
"\n", "\n",
"from official.nlp import modeling\n", "from tensorflow_models import nlp"
"from official.nlp.modeling import layers, losses, models, networks"
] ]
}, },
{ {
...@@ -151,9 +163,9 @@ ...@@ -151,9 +163,9 @@
"source": [ "source": [
"### Build a `BertPretrainer` model wrapping `BertEncoder`\n", "### Build a `BertPretrainer` model wrapping `BertEncoder`\n",
"\n", "\n",
"The [BertEncoder](https://github.com/tensorflow/models/blob/master/official/nlp/modeling/networks/bert_encoder.py) implements the Transformer-based encoder as described in [BERT paper](https://arxiv.org/abs/1810.04805). It includes the embedding lookups and transformer layers, but not the masked language model or classification task networks.\n", "The `nlp.networks.BertEncoder` class implements the Transformer-based encoder as described in [BERT paper](https://arxiv.org/abs/1810.04805). It includes the embedding lookups and transformer layers (`nlp.layers.TransformerEncoderBlock`), but not the masked language model or classification task networks.\n",
"\n", "\n",
"The [BertPretrainer](https://github.com/tensorflow/models/blob/master/official/nlp/modeling/models/bert_pretrainer.py) allows a user to pass in a transformer stack, and instantiates the masked language model and classification networks that are used to create the training objectives." "The `nlp.models.BertPretrainer` class allows a user to pass in a transformer stack, and instantiates the masked language model and classification networks that are used to create the training objectives."
] ]
}, },
{ {
...@@ -166,9 +178,10 @@ ...@@ -166,9 +178,10 @@
"source": [ "source": [
"# Build a small transformer network.\n", "# Build a small transformer network.\n",
"vocab_size = 100\n", "vocab_size = 100\n",
"sequence_length = 16\n", "network = nlp.networks.BertEncoder(\n",
"network = modeling.networks.BertEncoder(\n", " vocab_size=vocab_size, \n",
" vocab_size=vocab_size, num_layers=2, sequence_length=16)" " # The number of TransformerEncoderBlock layers\n",
" num_layers=3)"
] ]
}, },
{ {
...@@ -177,7 +190,7 @@ ...@@ -177,7 +190,7 @@
"id": "0NH5irV5KTMS" "id": "0NH5irV5KTMS"
}, },
"source": [ "source": [
"Inspecting the encoder, we see it contains few embedding layers, stacked `Transformer` layers and are connected to three input layers:\n", "Inspecting the encoder, we see it contains few embedding layers, stacked `nlp.layers.TransformerEncoderBlock` layers and are connected to three input layers:\n",
"\n", "\n",
"`input_word_ids`, `input_type_ids` and `input_mask`.\n" "`input_word_ids`, `input_type_ids` and `input_mask`.\n"
] ]
...@@ -190,7 +203,7 @@ ...@@ -190,7 +203,7 @@
}, },
"outputs": [], "outputs": [],
"source": [ "source": [
"tf.keras.utils.plot_model(network, show_shapes=True, dpi=48)" "tf.keras.utils.plot_model(network, show_shapes=True, expand_nested=True, dpi=48)"
] ]
}, },
{ {
...@@ -203,7 +216,7 @@ ...@@ -203,7 +216,7 @@
"source": [ "source": [
"# Create a BERT pretrainer with the created network.\n", "# Create a BERT pretrainer with the created network.\n",
"num_token_predictions = 8\n", "num_token_predictions = 8\n",
"bert_pretrainer = modeling.models.BertPretrainer(\n", "bert_pretrainer = nlp.models.BertPretrainer(\n",
" network, num_classes=2, num_token_predictions=num_token_predictions, output='predictions')" " network, num_classes=2, num_token_predictions=num_token_predictions, output='predictions')"
] ]
}, },
...@@ -213,7 +226,7 @@ ...@@ -213,7 +226,7 @@
"id": "d5h5HT7gNHx_" "id": "d5h5HT7gNHx_"
}, },
"source": [ "source": [
"Inspecting the `bert_pretrainer`, we see it wraps the `encoder` with additional `MaskedLM` and `Classification` heads." "Inspecting the `bert_pretrainer`, we see it wraps the `encoder` with additional `MaskedLM` and `nlp.layers.ClassificationHead` heads."
] ]
}, },
{ {
...@@ -224,7 +237,7 @@ ...@@ -224,7 +237,7 @@
}, },
"outputs": [], "outputs": [],
"source": [ "source": [
"tf.keras.utils.plot_model(bert_pretrainer, show_shapes=True, dpi=48)" "tf.keras.utils.plot_model(bert_pretrainer, show_shapes=True, expand_nested=True, dpi=48)"
] ]
}, },
{ {
...@@ -236,7 +249,9 @@ ...@@ -236,7 +249,9 @@
"outputs": [], "outputs": [],
"source": [ "source": [
"# We can feed some dummy data to get masked language model and sentence output.\n", "# We can feed some dummy data to get masked language model and sentence output.\n",
"sequence_length = 16\n",
"batch_size = 2\n", "batch_size = 2\n",
"\n",
"word_id_data = np.random.randint(vocab_size, size=(batch_size, sequence_length))\n", "word_id_data = np.random.randint(vocab_size, size=(batch_size, sequence_length))\n",
"mask_data = np.random.randint(2, size=(batch_size, sequence_length))\n", "mask_data = np.random.randint(2, size=(batch_size, sequence_length))\n",
"type_id_data = np.random.randint(2, size=(batch_size, sequence_length))\n", "type_id_data = np.random.randint(2, size=(batch_size, sequence_length))\n",
...@@ -246,8 +261,8 @@ ...@@ -246,8 +261,8 @@
" [word_id_data, mask_data, type_id_data, masked_lm_positions_data])\n", " [word_id_data, mask_data, type_id_data, masked_lm_positions_data])\n",
"lm_output = outputs[\"masked_lm\"]\n", "lm_output = outputs[\"masked_lm\"]\n",
"sentence_output = outputs[\"classification\"]\n", "sentence_output = outputs[\"classification\"]\n",
"print(lm_output)\n", "print(f'lm_output: shape={lm_output.shape}, dtype={lm_output.dtype!r}')\n",
"print(sentence_output)" "print(f'sentence_output: shape={sentence_output.shape}, dtype={sentence_output.dtype!r}')"
] ]
}, },
{ {
...@@ -272,14 +287,15 @@ ...@@ -272,14 +287,15 @@
"masked_lm_weights_data = np.random.randint(2, size=(batch_size, num_token_predictions))\n", "masked_lm_weights_data = np.random.randint(2, size=(batch_size, num_token_predictions))\n",
"next_sentence_labels_data = np.random.randint(2, size=(batch_size))\n", "next_sentence_labels_data = np.random.randint(2, size=(batch_size))\n",
"\n", "\n",
"mlm_loss = modeling.losses.weighted_sparse_categorical_crossentropy_loss(\n", "mlm_loss = nlp.losses.weighted_sparse_categorical_crossentropy_loss(\n",
" labels=masked_lm_ids_data,\n", " labels=masked_lm_ids_data,\n",
" predictions=lm_output,\n", " predictions=lm_output,\n",
" weights=masked_lm_weights_data)\n", " weights=masked_lm_weights_data)\n",
"sentence_loss = modeling.losses.weighted_sparse_categorical_crossentropy_loss(\n", "sentence_loss = nlp.losses.weighted_sparse_categorical_crossentropy_loss(\n",
" labels=next_sentence_labels_data,\n", " labels=next_sentence_labels_data,\n",
" predictions=sentence_output)\n", " predictions=sentence_output)\n",
"loss = mlm_loss + sentence_loss\n", "loss = mlm_loss + sentence_loss\n",
"\n",
"print(loss)" "print(loss)"
] ]
}, },
...@@ -290,8 +306,7 @@ ...@@ -290,8 +306,7 @@
}, },
"source": [ "source": [
"With the loss, you can optimize the model.\n", "With the loss, you can optimize the model.\n",
"After training, we can save the weights of TransformerEncoder for the downstream fine-tuning tasks. Please see [run_pretraining.py](https://github.com/tensorflow/models/blob/master/official/nlp/bert/run_pretraining.py) for the full example.\n", "After training, we can save the weights of TransformerEncoder for the downstream fine-tuning tasks. Please see [run_pretraining.py](https://github.com/tensorflow/models/blob/master/official/nlp/bert/run_pretraining.py) for the full example.\n"
"\n"
] ]
}, },
{ {
...@@ -315,9 +330,9 @@ ...@@ -315,9 +330,9 @@
"source": [ "source": [
"### Build a BertSpanLabeler wrapping BertEncoder\n", "### Build a BertSpanLabeler wrapping BertEncoder\n",
"\n", "\n",
"[BertSpanLabeler](https://github.com/tensorflow/models/blob/master/official/nlp/modeling/models/bert_span_labeler.py) implements a simple single-span start-end predictor (that is, a model that predicts two values: a start token index and an end token index), suitable for SQuAD-style tasks.\n", "The `nlp.models.BertSpanLabeler` class implements a simple single-span start-end predictor (that is, a model that predicts two values: a start token index and an end token index), suitable for SQuAD-style tasks.\n",
"\n", "\n",
"Note that `BertSpanLabeler` wraps a `BertEncoder`, the weights of which can be restored from the above pretraining model.\n" "Note that `nlp.models.BertSpanLabeler` wraps a `nlp.networks.BertEncoder`, the weights of which can be restored from the above pretraining model.\n"
] ]
}, },
{ {
...@@ -328,11 +343,11 @@ ...@@ -328,11 +343,11 @@
}, },
"outputs": [], "outputs": [],
"source": [ "source": [
"network = modeling.networks.BertEncoder(\n", "network = nlp.networks.BertEncoder(\n",
" vocab_size=vocab_size, num_layers=2, sequence_length=sequence_length)\n", " vocab_size=vocab_size, num_layers=2)\n",
"\n", "\n",
"# Create a BERT trainer with the created network.\n", "# Create a BERT trainer with the created network.\n",
"bert_span_labeler = modeling.models.BertSpanLabeler(network)" "bert_span_labeler = nlp.models.BertSpanLabeler(network)"
] ]
}, },
{ {
...@@ -352,7 +367,7 @@ ...@@ -352,7 +367,7 @@
}, },
"outputs": [], "outputs": [],
"source": [ "source": [
"tf.keras.utils.plot_model(bert_span_labeler, show_shapes=True, dpi=48)" "tf.keras.utils.plot_model(bert_span_labeler, show_shapes=True, expand_nested=True, dpi=48)"
] ]
}, },
{ {
...@@ -370,8 +385,9 @@ ...@@ -370,8 +385,9 @@
"\n", "\n",
"# Feed the data to the model.\n", "# Feed the data to the model.\n",
"start_logits, end_logits = bert_span_labeler([word_id_data, mask_data, type_id_data])\n", "start_logits, end_logits = bert_span_labeler([word_id_data, mask_data, type_id_data])\n",
"print(start_logits)\n", "\n",
"print(end_logits)" "print(f'start_logits: shape={start_logits.shape}, dtype={start_logits.dtype!r}')\n",
"print(f'end_logits: shape={end_logits.shape}, dtype={end_logits.dtype!r}')"
] ]
}, },
{ {
...@@ -432,7 +448,7 @@ ...@@ -432,7 +448,7 @@
"source": [ "source": [
"### Build a BertClassifier model wrapping BertEncoder\n", "### Build a BertClassifier model wrapping BertEncoder\n",
"\n", "\n",
"[BertClassifier](https://github.com/tensorflow/models/blob/master/official/nlp/modeling/models/bert_classifier.py) implements a [CLS] token classification model containing a single classification head." "`nlp.models.BertClassifier` implements a [CLS] token classification model containing a single classification head."
] ]
}, },
{ {
...@@ -443,12 +459,12 @@ ...@@ -443,12 +459,12 @@
}, },
"outputs": [], "outputs": [],
"source": [ "source": [
"network = modeling.networks.BertEncoder(\n", "network = nlp.networks.BertEncoder(\n",
" vocab_size=vocab_size, num_layers=2, sequence_length=sequence_length)\n", " vocab_size=vocab_size, num_layers=2)\n",
"\n", "\n",
"# Create a BERT trainer with the created network.\n", "# Create a BERT trainer with the created network.\n",
"num_classes = 2\n", "num_classes = 2\n",
"bert_classifier = modeling.models.BertClassifier(\n", "bert_classifier = nlp.models.BertClassifier(\n",
" network, num_classes=num_classes)" " network, num_classes=num_classes)"
] ]
}, },
...@@ -469,7 +485,7 @@ ...@@ -469,7 +485,7 @@
}, },
"outputs": [], "outputs": [],
"source": [ "source": [
"tf.keras.utils.plot_model(bert_classifier, show_shapes=True, dpi=48)" "tf.keras.utils.plot_model(bert_classifier, show_shapes=True, expand_nested=True, dpi=48)"
] ]
}, },
{ {
...@@ -487,7 +503,7 @@ ...@@ -487,7 +503,7 @@
"\n", "\n",
"# Feed the data to the model.\n", "# Feed the data to the model.\n",
"logits = bert_classifier([word_id_data, mask_data, type_id_data])\n", "logits = bert_classifier([word_id_data, mask_data, type_id_data])\n",
"print(logits)" "print(f'logits: shape={logits.shape}, dtype={logits.dtype!r}')"
] ]
}, },
{ {
...@@ -529,8 +545,7 @@ ...@@ -529,8 +545,7 @@
"metadata": { "metadata": {
"colab": { "colab": {
"collapsed_sections": [], "collapsed_sections": [],
"name": "Introduction to the TensorFlow Models NLP library", "name": "nlp_modeling_library_intro.ipynb",
"private_outputs": true,
"provenance": [], "provenance": [],
"toc_visible": true "toc_visible": true
}, },
......
...@@ -12,3 +12,15 @@ ...@@ -12,3 +12,15 @@
# See the License for the specific language governing permissions and # See the License for the specific language governing permissions and
# limitations under the License. # limitations under the License.
"""Core is shared by both `nlp` and `vision`."""
from official.core import actions
from official.core import base_task
from official.core import base_trainer
from official.core import config_definitions
from official.core import exp_factory
from official.core import export_base
from official.core import input_reader
from official.core import registry
from official.core import task_factory
from official.core import train_lib
from official.core import train_utils
...@@ -33,57 +33,6 @@ ExperimentConfig = config_definitions.ExperimentConfig ...@@ -33,57 +33,6 @@ ExperimentConfig = config_definitions.ExperimentConfig
TrainerConfig = config_definitions.TrainerConfig TrainerConfig = config_definitions.TrainerConfig
class Recovery:
"""Built-in model blowup recovery module.
Checks the loss value by the given threshold. If applicable, recover the
model by reading the checkpoint on disk.
"""
def __init__(self,
loss_upper_bound: float,
checkpoint_manager: tf.train.CheckpointManager,
recovery_begin_steps: int = 0,
recovery_max_trials: int = 3):
self.recover_counter = 0
self.recovery_begin_steps = recovery_begin_steps
self.recovery_max_trials = recovery_max_trials
self.loss_upper_bound = loss_upper_bound
self.checkpoint_manager = checkpoint_manager
def should_recover(self, loss_value, global_step):
if tf.math.is_nan(loss_value):
return True
if (global_step >= self.recovery_begin_steps and
loss_value > self.loss_upper_bound):
return True
return False
def maybe_recover(self, loss_value, global_step):
"""Conditionally recovers the training by triggering checkpoint restoration.
Args:
loss_value: the loss value as a float.
global_step: the number of global training steps.
Raises:
RuntimeError: when recovery happens more than the max number of trials,
the job should crash.
"""
if not self.should_recover(loss_value, global_step):
return
self.recover_counter += 1
if self.recover_counter > self.recovery_max_trials:
raise RuntimeError(
"The loss value is NaN or out of range after training loop and "
f"this happens {self.recover_counter} times.")
# Loads the previous good checkpoint.
checkpoint_path = self.checkpoint_manager.restore_or_initialize()
logging.warning(
"Recovering the model from checkpoint: %s. The loss value becomes "
"%f at step %d.", checkpoint_path, loss_value, global_step)
class _AsyncTrainer(orbit.StandardTrainer, orbit.StandardEvaluator): class _AsyncTrainer(orbit.StandardTrainer, orbit.StandardEvaluator):
"""Trainer class for both sync and async Strategy.""" """Trainer class for both sync and async Strategy."""
......
...@@ -150,30 +150,6 @@ class MockAsyncTrainer(trainer_lib._AsyncTrainer): ...@@ -150,30 +150,6 @@ class MockAsyncTrainer(trainer_lib._AsyncTrainer):
return self.eval_global_step.numpy() return self.eval_global_step.numpy()
class RecoveryTest(tf.test.TestCase):
def test_recovery_module(self):
ckpt = tf.train.Checkpoint(v=tf.Variable(1, dtype=tf.int32))
model_dir = self.get_temp_dir()
manager = tf.train.CheckpointManager(ckpt, model_dir, max_to_keep=1)
recovery_module = trainer_lib.Recovery(
loss_upper_bound=1.0,
checkpoint_manager=manager,
recovery_begin_steps=1,
recovery_max_trials=1)
self.assertFalse(recovery_module.should_recover(1.1, 0))
self.assertFalse(recovery_module.should_recover(0.1, 1))
self.assertTrue(recovery_module.should_recover(1.1, 2))
# First triggers the recovery once.
recovery_module.maybe_recover(1.1, 10)
# Second time, it raises.
with self.assertRaisesRegex(
RuntimeError, 'The loss value is NaN .*'):
recovery_module.maybe_recover(1.1, 10)
class TrainerTest(tf.test.TestCase, parameterized.TestCase): class TrainerTest(tf.test.TestCase, parameterized.TestCase):
def setUp(self): def setUp(self):
......
...@@ -76,6 +76,10 @@ class DataConfig(base_config.Config): ...@@ -76,6 +76,10 @@ class DataConfig(base_config.Config):
features. The main use case is to skip the image/video decoding for better features. The main use case is to skip the image/video decoding for better
performance. performance.
seed: An optional seed to use for deterministic shuffling/preprocessing. seed: An optional seed to use for deterministic shuffling/preprocessing.
prefetch_buffer_size: An int specifying the buffer size of prefetch
datasets. If None, the buffer size is autotuned. Specifying this is useful
in case autotuning uses up too much memory by making the buffer size too
high.
""" """
input_path: Union[Sequence[str], str, base_config.Config] = "" input_path: Union[Sequence[str], str, base_config.Config] = ""
tfds_name: str = "" tfds_name: str = ""
...@@ -96,6 +100,7 @@ class DataConfig(base_config.Config): ...@@ -96,6 +100,7 @@ class DataConfig(base_config.Config):
tfds_as_supervised: bool = False tfds_as_supervised: bool = False
tfds_skip_decoding_feature: str = "" tfds_skip_decoding_feature: str = ""
seed: Optional[int] = None seed: Optional[int] = None
prefetch_buffer_size: Optional[int] = None
@dataclasses.dataclass @dataclasses.dataclass
...@@ -190,8 +195,8 @@ class TrainerConfig(base_config.Config): ...@@ -190,8 +195,8 @@ class TrainerConfig(base_config.Config):
is only used continuous_train_and_eval and continuous_eval modes. Default is only used continuous_train_and_eval and continuous_eval modes. Default
value is 1 hrs. value is 1 hrs.
train_steps: number of train steps. train_steps: number of train steps.
validation_steps: number of eval steps. If `None`, the entire eval dataset validation_steps: number of eval steps. If -1, the entire eval dataset is
is used. used.
validation_interval: number of training steps to run between evaluations. validation_interval: number of training steps to run between evaluations.
best_checkpoint_export_subdir: if set, the trainer will keep track of the best_checkpoint_export_subdir: if set, the trainer will keep track of the
best evaluation metric, and export the corresponding best checkpoint under best evaluation metric, and export the corresponding best checkpoint under
......
...@@ -292,6 +292,8 @@ class InputReader: ...@@ -292,6 +292,8 @@ class InputReader:
self._transform_and_batch_fn = transform_and_batch_fn self._transform_and_batch_fn = transform_and_batch_fn
self._postprocess_fn = postprocess_fn self._postprocess_fn = postprocess_fn
self._seed = params.seed self._seed = params.seed
self._prefetch_buffer_size = (params.prefetch_buffer_size or
tf.data.experimental.AUTOTUNE)
# When tf.data service is enabled, each data service worker should get # When tf.data service is enabled, each data service worker should get
# different random seeds. Thus, we set `seed` to None. # different random seeds. Thus, we set `seed` to None.
...@@ -505,4 +507,4 @@ class InputReader: ...@@ -505,4 +507,4 @@ class InputReader:
options = tf.data.Options() options = tf.data.Options()
options.experimental_deterministic = self._deterministic options.experimental_deterministic = self._deterministic
dataset = dataset.with_options(options) dataset = dataset.with_options(options)
return dataset.prefetch(tf.data.experimental.AUTOTUNE) return dataset.prefetch(self._prefetch_buffer_size)
# Copyright 2022 The TensorFlow Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Adam optimizer with weight decay that exactly matches the original BERT."""
import re
from absl import logging
import tensorflow as tf
class AdamWeightDecay(tf.keras.optimizers.Adam):
"""Adam enables L2 weight decay and clip_by_global_norm on gradients.
[Warning!]: Keras optimizer supports gradient clipping and has an AdamW
implementation. Please consider evaluating the choice in Keras package.
Just adding the square of the weights to the loss function is *not* the
correct way of using L2 regularization/weight decay with Adam, since that will
interact with the m and v parameters in strange ways.
Instead we want to decay the weights in a manner that doesn't interact with
the m/v parameters. This is equivalent to adding the square of the weights to
the loss with plain (non-momentum) SGD.
"""
def __init__(self,
learning_rate=0.001,
beta_1=0.9,
beta_2=0.999,
epsilon=1e-7,
amsgrad=False,
weight_decay_rate=0.0,
include_in_weight_decay=None,
exclude_from_weight_decay=None,
gradient_clip_norm=1.0,
name='AdamWeightDecay',
**kwargs):
super(AdamWeightDecay, self).__init__(learning_rate, beta_1, beta_2,
epsilon, amsgrad, name, **kwargs)
self.weight_decay_rate = weight_decay_rate
self.gradient_clip_norm = gradient_clip_norm
self._include_in_weight_decay = include_in_weight_decay
self._exclude_from_weight_decay = exclude_from_weight_decay
logging.info('AdamWeightDecay gradient_clip_norm=%f', gradient_clip_norm)
def _prepare_local(self, var_device, var_dtype, apply_state):
super(AdamWeightDecay, self)._prepare_local(var_device, var_dtype, # pytype: disable=attribute-error # typed-keras
apply_state)
apply_state[(var_device, var_dtype)]['weight_decay_rate'] = tf.constant(
self.weight_decay_rate, name='adam_weight_decay_rate')
def _decay_weights_op(self, var, learning_rate, apply_state):
do_decay = self._do_use_weight_decay(var.name)
if do_decay:
return var.assign_sub(
learning_rate * var *
apply_state[(var.device, var.dtype.base_dtype)]['weight_decay_rate'],
use_locking=self._use_locking)
return tf.no_op()
def apply_gradients(self,
grads_and_vars,
name=None,
experimental_aggregate_gradients=True):
grads, tvars = list(zip(*grads_and_vars))
if experimental_aggregate_gradients and self.gradient_clip_norm > 0.0:
# when experimental_aggregate_gradients = False, apply_gradients() no
# longer implicitly allreduce gradients, users manually allreduce gradient
# and passed the allreduced grads_and_vars. For now, the
# clip_by_global_norm will be moved to before the explicit allreduce to
# keep the math the same as TF 1 and pre TF 2.2 implementation.
(grads, _) = tf.clip_by_global_norm(
grads, clip_norm=self.gradient_clip_norm)
return super(AdamWeightDecay, self).apply_gradients(
zip(grads, tvars),
name=name,
experimental_aggregate_gradients=experimental_aggregate_gradients)
def _get_lr(self, var_device, var_dtype, apply_state):
"""Retrieves the learning rate with the given state."""
if apply_state is None:
return self._decayed_lr_t[var_dtype], {}
apply_state = apply_state or {}
coefficients = apply_state.get((var_device, var_dtype))
if coefficients is None:
coefficients = self._fallback_apply_state(var_device, var_dtype)
apply_state[(var_device, var_dtype)] = coefficients
return coefficients['lr_t'], dict(apply_state=apply_state)
def _resource_apply_dense(self, grad, var, apply_state=None):
lr_t, kwargs = self._get_lr(var.device, var.dtype.base_dtype, apply_state)
decay = self._decay_weights_op(var, lr_t, apply_state)
with tf.control_dependencies([decay]):
return super(AdamWeightDecay,
self)._resource_apply_dense(grad, var, **kwargs) # pytype: disable=attribute-error # typed-keras
def _resource_apply_sparse(self, grad, var, indices, apply_state=None):
lr_t, kwargs = self._get_lr(var.device, var.dtype.base_dtype, apply_state)
decay = self._decay_weights_op(var, lr_t, apply_state)
with tf.control_dependencies([decay]):
return super(AdamWeightDecay,
self)._resource_apply_sparse(grad, var, indices, **kwargs) # pytype: disable=attribute-error # typed-keras
def get_config(self):
config = super(AdamWeightDecay, self).get_config()
config.update({
'weight_decay_rate': self.weight_decay_rate,
})
return config
def _do_use_weight_decay(self, param_name):
"""Whether to use L2 weight decay for `param_name`."""
if self.weight_decay_rate == 0:
return False
if self._include_in_weight_decay:
for r in self._include_in_weight_decay:
if re.search(r, param_name) is not None:
return True
if self._exclude_from_weight_decay:
for r in self._exclude_from_weight_decay:
if re.search(r, param_name) is not None:
return False
return True
...@@ -18,20 +18,21 @@ from typing import Callable, Optional, Union, List, Tuple ...@@ -18,20 +18,21 @@ from typing import Callable, Optional, Union, List, Tuple
import gin import gin
import tensorflow as tf import tensorflow as tf
import tensorflow_addons.optimizers as tfa_optimizers import tensorflow_addons.optimizers as tfa_optimizers
from official.modeling.optimization import slide_optimizer from official.modeling.optimization import slide_optimizer
from official.modeling.optimization import adafactor_optimizer from official.modeling.optimization import adafactor_optimizer
from official.modeling.optimization import ema_optimizer from official.modeling.optimization import ema_optimizer
from official.modeling.optimization import lars_optimizer from official.modeling.optimization import lars_optimizer
from official.modeling.optimization import legacy_adamw
from official.modeling.optimization import lr_schedule from official.modeling.optimization import lr_schedule
from official.modeling.optimization.configs import optimization_config as opt_cfg from official.modeling.optimization.configs import optimization_config as opt_cfg
from official.nlp import optimization as nlp_optimization
OPTIMIZERS_CLS = { OPTIMIZERS_CLS = {
'sgd': tf.keras.optimizers.SGD, 'sgd': tf.keras.optimizers.SGD,
# TODO(chenmoneygithub): experimental.SGD # TODO(chenmoneygithub): experimental.SGD
'adam': tf.keras.optimizers.Adam, 'adam': tf.keras.optimizers.Adam,
# TODO(chenmoneygithub): experimental.Adam # TODO(chenmoneygithub): experimental.Adam
'adamw': nlp_optimization.AdamWeightDecay, 'adamw': legacy_adamw.AdamWeightDecay,
'lamb': tfa_optimizers.LAMB, 'lamb': tfa_optimizers.LAMB,
'rmsprop': tf.keras.optimizers.RMSprop, 'rmsprop': tf.keras.optimizers.RMSprop,
'lars': lars_optimizer.LARS, 'lars': lars_optimizer.LARS,
...@@ -57,8 +58,8 @@ WARMUP_CLS = { ...@@ -57,8 +58,8 @@ WARMUP_CLS = {
} }
def register_optimizer_cls( def register_optimizer_cls(key: str,
key: str, optimizer_config_cls: tf.keras.optimizers.Optimizer): optimizer_config_cls: tf.keras.optimizers.Optimizer):
"""Register customize optimizer cls. """Register customize optimizer cls.
The user will still need to subclass data classes in The user will still need to subclass data classes in
...@@ -85,6 +86,8 @@ class OptimizerFactory: ...@@ -85,6 +86,8 @@ class OptimizerFactory:
(4) Build optimizer. (4) Build optimizer.
This is a typical example for using this class: This is a typical example for using this class:
```
params = { params = {
'optimizer': { 'optimizer': {
'type': 'sgd', 'type': 'sgd',
...@@ -104,6 +107,7 @@ class OptimizerFactory: ...@@ -104,6 +107,7 @@ class OptimizerFactory:
opt_factory = OptimizerFactory(opt_config) opt_factory = OptimizerFactory(opt_config)
lr = opt_factory.build_learning_rate() lr = opt_factory.build_learning_rate()
optimizer = opt_factory.build_optimizer(lr) optimizer = opt_factory.build_optimizer(lr)
```
""" """
def __init__(self, config: opt_cfg.OptimizationConfig): def __init__(self, config: opt_cfg.OptimizationConfig):
...@@ -156,9 +160,12 @@ class OptimizerFactory: ...@@ -156,9 +160,12 @@ class OptimizerFactory:
def build_optimizer( def build_optimizer(
self, self,
lr: Union[tf.keras.optimizers.schedules.LearningRateSchedule, float], lr: Union[tf.keras.optimizers.schedules.LearningRateSchedule, float],
gradient_aggregator: Optional[Callable[
[List[Tuple[tf.Tensor, tf.Tensor]]], List[Tuple[tf.Tensor,
tf.Tensor]]]] = None,
gradient_transformers: Optional[List[Callable[ gradient_transformers: Optional[List[Callable[
[List[Tuple[tf.Tensor, tf.Tensor]]], List[Tuple[tf.Tensor, tf.Tensor]] [List[Tuple[tf.Tensor, tf.Tensor]]], List[Tuple[tf.Tensor,
]]] = None, tf.Tensor]]]]] = None,
postprocessor: Optional[Callable[[tf.keras.optimizers.Optimizer], postprocessor: Optional[Callable[[tf.keras.optimizers.Optimizer],
tf.keras.optimizers.Optimizer]] = None): tf.keras.optimizers.Optimizer]] = None):
"""Build optimizer. """Build optimizer.
...@@ -170,6 +177,7 @@ class OptimizerFactory: ...@@ -170,6 +177,7 @@ class OptimizerFactory:
Args: Args:
lr: A floating point value, or a lr: A floating point value, or a
tf.keras.optimizers.schedules.LearningRateSchedule instance. tf.keras.optimizers.schedules.LearningRateSchedule instance.
gradient_aggregator: Optional function to overwrite gradient aggregation.
gradient_transformers: Optional list of functions to use to transform gradient_transformers: Optional list of functions to use to transform
gradients before applying updates to Variables. The functions are gradients before applying updates to Variables. The functions are
applied after gradient_aggregator. The functions should accept and applied after gradient_aggregator. The functions should accept and
...@@ -193,6 +201,8 @@ class OptimizerFactory: ...@@ -193,6 +201,8 @@ class OptimizerFactory:
del optimizer_dict['global_clipnorm'] del optimizer_dict['global_clipnorm']
optimizer_dict['learning_rate'] = lr optimizer_dict['learning_rate'] = lr
if gradient_aggregator is not None:
optimizer_dict['gradient_aggregator'] = gradient_aggregator
if gradient_transformers is not None: if gradient_transformers is not None:
optimizer_dict['gradient_transformers'] = gradient_transformers optimizer_dict['gradient_transformers'] = gradient_transformers
......
...@@ -49,6 +49,39 @@ class OptimizerFactoryTest(tf.test.TestCase, parameterized.TestCase): ...@@ -49,6 +49,39 @@ class OptimizerFactoryTest(tf.test.TestCase, parameterized.TestCase):
self.assertIsInstance(optimizer, optimizer_cls) self.assertIsInstance(optimizer, optimizer_cls)
self.assertEqual(expected_optimizer_config, optimizer.get_config()) self.assertEqual(expected_optimizer_config, optimizer.get_config())
def test_gradient_aggregator(self):
params = {
'optimizer': {
'type': 'adam',
},
'learning_rate': {
'type': 'constant',
'constant': {
'learning_rate': 1.0
}
}
}
opt_config = optimization_config.OptimizationConfig(params)
opt_factory = optimizer_factory.OptimizerFactory(opt_config)
lr = opt_factory.build_learning_rate()
# Dummy function to zero out gradients.
zero_grads = lambda gv: [(tf.zeros_like(g), v) for g, v in gv]
optimizer = opt_factory.build_optimizer(lr, gradient_aggregator=zero_grads)
var0 = tf.Variable([1.0, 2.0])
var1 = tf.Variable([3.0, 4.0])
grads0 = tf.constant([1.0, 1.0])
grads1 = tf.constant([1.0, 1.0])
grads_and_vars = list(zip([grads0, grads1], [var0, var1]))
optimizer.apply_gradients(grads_and_vars)
self.assertAllClose(np.array([1.0, 2.0]), var0.numpy())
self.assertAllClose(np.array([3.0, 4.0]), var1.numpy())
@parameterized.parameters((None, None), (1.0, None), (None, 1.0)) @parameterized.parameters((None, None), (1.0, None), (None, 1.0))
def test_gradient_clipping(self, clipnorm, clipvalue): def test_gradient_clipping(self, clipnorm, clipvalue):
params = { params = {
...@@ -418,7 +451,7 @@ class OptimizerFactoryTest(tf.test.TestCase, parameterized.TestCase): ...@@ -418,7 +451,7 @@ class OptimizerFactoryTest(tf.test.TestCase, parameterized.TestCase):
} }
} }
} }
expected_lr_step_values = [[0, 0.0], [5000, 1e-4/2.0], [10000, 1e-4], expected_lr_step_values = [[0, 0.0], [5000, 1e-4 / 2.0], [10000, 1e-4],
[20000, 9.994863e-05], [499999, 5e-05]] [20000, 9.994863e-05], [499999, 5e-05]]
opt_config = optimization_config.OptimizationConfig(params) opt_config = optimization_config.OptimizationConfig(params)
opt_factory = optimizer_factory.OptimizerFactory(opt_config) opt_factory = optimizer_factory.OptimizerFactory(opt_config)
...@@ -434,10 +467,12 @@ class OptimizerFactoryRegistryTest(tf.test.TestCase): ...@@ -434,10 +467,12 @@ class OptimizerFactoryRegistryTest(tf.test.TestCase):
class MyClass(): class MyClass():
pass pass
optimizer_factory.register_optimizer_cls('test', MyClass) optimizer_factory.register_optimizer_cls('test', MyClass)
self.assertIn('test', optimizer_factory.OPTIMIZERS_CLS) self.assertIn('test', optimizer_factory.OPTIMIZERS_CLS)
with self.assertRaisesRegex(ValueError, 'test already registered.*'): with self.assertRaisesRegex(ValueError, 'test already registered.*'):
optimizer_factory.register_optimizer_cls('test', MyClass) optimizer_factory.register_optimizer_cls('test', MyClass)
if __name__ == '__main__': if __name__ == '__main__':
tf.test.main() tf.test.main()
# TensorFlow NLP Modelling Toolkit # TF-NLP Model Garden
⚠️ Disclaimer: All datasets hyperlinked from this page are not owned or
distributed by Google. The dataset is made available by third parties.
Please review the terms and conditions made available by the third parties
before using the data.
This codebase provides a Natrual Language Processing modeling toolkit written in This codebase provides a Natrual Language Processing modeling toolkit written in
[TF2](https://www.tensorflow.org/guide/effective_tf2). It allows researchers and [TF2](https://www.tensorflow.org/guide/effective_tf2). It allows researchers and
...@@ -30,7 +35,10 @@ research ideas. Detailed intructions can be found in READMEs in each folder. ...@@ -30,7 +35,10 @@ research ideas. Detailed intructions can be found in READMEs in each folder.
We provide SoTA model implementations, pre-trained models, training and We provide SoTA model implementations, pre-trained models, training and
evaluation examples, and command lines. Detail instructions can be found in the evaluation examples, and command lines. Detail instructions can be found in the
READMEs for specific papers. READMEs for specific papers. Below are some papers implemented in the
repository and more NLP projects can be found in the
[`projects`](https://github.com/tensorflow/models/tree/master/official/projects)
folder:
1. [BERT](MODEL_GARDEN.md#available-model-configs): [BERT: Pre-training of Deep Bidirectional Transformers for 1. [BERT](MODEL_GARDEN.md#available-model-configs): [BERT: Pre-training of Deep Bidirectional Transformers for
Language Understanding](https://arxiv.org/abs/1810.04805) by Devlin et al., Language Understanding](https://arxiv.org/abs/1810.04805) by Devlin et al.,
...@@ -38,10 +46,10 @@ READMEs for specific papers. ...@@ -38,10 +46,10 @@ READMEs for specific papers.
2. [ALBERT](MODEL_GARDEN.md#available-model-configs): 2. [ALBERT](MODEL_GARDEN.md#available-model-configs):
[A Lite BERT for Self-supervised Learning of Language Representations](https://arxiv.org/abs/1909.11942) [A Lite BERT for Self-supervised Learning of Language Representations](https://arxiv.org/abs/1909.11942)
by Lan et al., 2019 by Lan et al., 2019
3. [XLNet](xlnet): 3. [XLNet](MODEL_GARDEN.md):
[XLNet: Generalized Autoregressive Pretraining for Language Understanding](https://arxiv.org/abs/1906.08237) [XLNet: Generalized Autoregressive Pretraining for Language Understanding](https://arxiv.org/abs/1906.08237)
by Yang et al., 2019 by Yang et al., 2019
4. [Transformer for translation](transformer): 4. [Transformer for translation](MODEL_GARDEN.md#available-model-configs):
[Attention Is All You Need](https://arxiv.org/abs/1706.03762) by Vaswani et [Attention Is All You Need](https://arxiv.org/abs/1706.03762) by Vaswani et
al., 2017 al., 2017
......
...@@ -17,4 +17,3 @@ ...@@ -17,4 +17,3 @@
from official.nlp.configs import finetuning_experiments from official.nlp.configs import finetuning_experiments
from official.nlp.configs import pretraining_experiments from official.nlp.configs import pretraining_experiments
from official.nlp.configs import wmt_transformer_experiments from official.nlp.configs import wmt_transformer_experiments
from official.projects.teams import teams_experiments
...@@ -187,6 +187,8 @@ class AxProcessor(DataProcessor): ...@@ -187,6 +187,8 @@ class AxProcessor(DataProcessor):
def _create_examples_tfds(self, dataset, set_type): def _create_examples_tfds(self, dataset, set_type):
"""Creates examples for the training/dev/test sets.""" """Creates examples for the training/dev/test sets."""
dataset = list(dataset)
dataset.sort(key=lambda x: x["idx"])
examples = [] examples = []
for i, example in enumerate(dataset): for i, example in enumerate(dataset):
guid = "%s-%s" % (set_type, i) guid = "%s-%s" % (set_type, i)
...@@ -218,6 +220,8 @@ class ColaProcessor(DefaultGLUEDataProcessor): ...@@ -218,6 +220,8 @@ class ColaProcessor(DefaultGLUEDataProcessor):
"""Creates examples for the training/dev/test sets.""" """Creates examples for the training/dev/test sets."""
dataset = tfds.load( dataset = tfds.load(
"glue/cola", split=set_type, try_gcs=True).as_numpy_iterator() "glue/cola", split=set_type, try_gcs=True).as_numpy_iterator()
dataset = list(dataset)
dataset.sort(key=lambda x: x["idx"])
examples = [] examples = []
for i, example in enumerate(dataset): for i, example in enumerate(dataset):
guid = "%s-%s" % (set_type, i) guid = "%s-%s" % (set_type, i)
...@@ -312,6 +316,8 @@ class MnliProcessor(DataProcessor): ...@@ -312,6 +316,8 @@ class MnliProcessor(DataProcessor):
"""Creates examples for the training/dev/test sets.""" """Creates examples for the training/dev/test sets."""
dataset = tfds.load( dataset = tfds.load(
"glue/mnli", split=set_type, try_gcs=True).as_numpy_iterator() "glue/mnli", split=set_type, try_gcs=True).as_numpy_iterator()
dataset = list(dataset)
dataset.sort(key=lambda x: x["idx"])
examples = [] examples = []
for i, example in enumerate(dataset): for i, example in enumerate(dataset):
guid = "%s-%s" % (set_type, i) guid = "%s-%s" % (set_type, i)
...@@ -343,6 +349,8 @@ class MrpcProcessor(DefaultGLUEDataProcessor): ...@@ -343,6 +349,8 @@ class MrpcProcessor(DefaultGLUEDataProcessor):
"""Creates examples for the training/dev/test sets.""" """Creates examples for the training/dev/test sets."""
dataset = tfds.load( dataset = tfds.load(
"glue/mrpc", split=set_type, try_gcs=True).as_numpy_iterator() "glue/mrpc", split=set_type, try_gcs=True).as_numpy_iterator()
dataset = list(dataset)
dataset.sort(key=lambda x: x["idx"])
examples = [] examples = []
for i, example in enumerate(dataset): for i, example in enumerate(dataset):
guid = "%s-%s" % (set_type, i) guid = "%s-%s" % (set_type, i)
...@@ -453,6 +461,8 @@ class QnliProcessor(DefaultGLUEDataProcessor): ...@@ -453,6 +461,8 @@ class QnliProcessor(DefaultGLUEDataProcessor):
"""Creates examples for the training/dev/test sets.""" """Creates examples for the training/dev/test sets."""
dataset = tfds.load( dataset = tfds.load(
"glue/qnli", split=set_type, try_gcs=True).as_numpy_iterator() "glue/qnli", split=set_type, try_gcs=True).as_numpy_iterator()
dataset = list(dataset)
dataset.sort(key=lambda x: x["idx"])
examples = [] examples = []
for i, example in enumerate(dataset): for i, example in enumerate(dataset):
guid = "%s-%s" % (set_type, i) guid = "%s-%s" % (set_type, i)
...@@ -484,6 +494,8 @@ class QqpProcessor(DefaultGLUEDataProcessor): ...@@ -484,6 +494,8 @@ class QqpProcessor(DefaultGLUEDataProcessor):
"""Creates examples for the training/dev/test sets.""" """Creates examples for the training/dev/test sets."""
dataset = tfds.load( dataset = tfds.load(
"glue/qqp", split=set_type, try_gcs=True).as_numpy_iterator() "glue/qqp", split=set_type, try_gcs=True).as_numpy_iterator()
dataset = list(dataset)
dataset.sort(key=lambda x: x["idx"])
examples = [] examples = []
for i, example in enumerate(dataset): for i, example in enumerate(dataset):
guid = "%s-%s" % (set_type, i) guid = "%s-%s" % (set_type, i)
...@@ -517,6 +529,8 @@ class RteProcessor(DefaultGLUEDataProcessor): ...@@ -517,6 +529,8 @@ class RteProcessor(DefaultGLUEDataProcessor):
"""Creates examples for the training/dev/test sets.""" """Creates examples for the training/dev/test sets."""
dataset = tfds.load( dataset = tfds.load(
"glue/rte", split=set_type, try_gcs=True).as_numpy_iterator() "glue/rte", split=set_type, try_gcs=True).as_numpy_iterator()
dataset = list(dataset)
dataset.sort(key=lambda x: x["idx"])
examples = [] examples = []
for i, example in enumerate(dataset): for i, example in enumerate(dataset):
guid = "%s-%s" % (set_type, i) guid = "%s-%s" % (set_type, i)
...@@ -548,6 +562,8 @@ class SstProcessor(DefaultGLUEDataProcessor): ...@@ -548,6 +562,8 @@ class SstProcessor(DefaultGLUEDataProcessor):
"""Creates examples for the training/dev/test sets.""" """Creates examples for the training/dev/test sets."""
dataset = tfds.load( dataset = tfds.load(
"glue/sst2", split=set_type, try_gcs=True).as_numpy_iterator() "glue/sst2", split=set_type, try_gcs=True).as_numpy_iterator()
dataset = list(dataset)
dataset.sort(key=lambda x: x["idx"])
examples = [] examples = []
for i, example in enumerate(dataset): for i, example in enumerate(dataset):
guid = "%s-%s" % (set_type, i) guid = "%s-%s" % (set_type, i)
...@@ -574,6 +590,8 @@ class StsBProcessor(DefaultGLUEDataProcessor): ...@@ -574,6 +590,8 @@ class StsBProcessor(DefaultGLUEDataProcessor):
"""Creates examples for the training/dev/test sets.""" """Creates examples for the training/dev/test sets."""
dataset = tfds.load( dataset = tfds.load(
"glue/stsb", split=set_type, try_gcs=True).as_numpy_iterator() "glue/stsb", split=set_type, try_gcs=True).as_numpy_iterator()
dataset = list(dataset)
dataset.sort(key=lambda x: x["idx"])
examples = [] examples = []
for i, example in enumerate(dataset): for i, example in enumerate(dataset):
guid = "%s-%s" % (set_type, i) guid = "%s-%s" % (set_type, i)
...@@ -742,6 +760,8 @@ class WnliProcessor(DefaultGLUEDataProcessor): ...@@ -742,6 +760,8 @@ class WnliProcessor(DefaultGLUEDataProcessor):
"""Creates examples for the training/dev/test sets.""" """Creates examples for the training/dev/test sets."""
dataset = tfds.load( dataset = tfds.load(
"glue/wnli", split=set_type, try_gcs=True).as_numpy_iterator() "glue/wnli", split=set_type, try_gcs=True).as_numpy_iterator()
dataset = list(dataset)
dataset.sort(key=lambda x: x["idx"])
examples = [] examples = []
for i, example in enumerate(dataset): for i, example in enumerate(dataset):
guid = "%s-%s" % (set_type, i) guid = "%s-%s" % (set_type, i)
......
...@@ -178,13 +178,13 @@ class KernelAttention(tf.keras.layers.MultiHeadAttention): ...@@ -178,13 +178,13 @@ class KernelAttention(tf.keras.layers.MultiHeadAttention):
is_short_seq=False, is_short_seq=False,
begin_kernel=0, begin_kernel=0,
scale=None, scale=None,
scale_by_length=False,
**kwargs): **kwargs):
r"""Constructor of KernelAttention. r"""Constructor of KernelAttention.
Args: Args:
feature_transform: A non-linear transform of the keys and quries. feature_transform: A non-linear transform of the keys and quries. Possible
Possible transforms are "elu", "relu", "square", "exp", "expmod", transforms are "elu", "relu", "square", "exp", "expmod", "identity".
"identity".
num_random_features: Number of random features to be used for projection. num_random_features: Number of random features to be used for projection.
if num_random_features <= 0, no production is used before transform. if num_random_features <= 0, no production is used before transform.
seed: The seed to begin drawing random features. Once the seed is set, the seed: The seed to begin drawing random features. Once the seed is set, the
...@@ -194,12 +194,16 @@ class KernelAttention(tf.keras.layers.MultiHeadAttention): ...@@ -194,12 +194,16 @@ class KernelAttention(tf.keras.layers.MultiHeadAttention):
redraw: Whether to redraw projection every forward pass during training. redraw: Whether to redraw projection every forward pass during training.
The argument is only effective when num_random_features > 0. The argument is only effective when num_random_features > 0.
is_short_seq: boolean predicate indicating whether input data consists of is_short_seq: boolean predicate indicating whether input data consists of
very short sequences or not; in most cases this should be False very short sequences or not; in most cases this should be False (default
(default option). option).
begin_kernel: Apply kernel_attention after this sequence id and apply begin_kernel: Apply kernel_attention after this sequence id and apply
softmax attention before this. softmax attention before this.
scale: The value to scale the dot product as described in `Attention Is scale: The value to scale the dot product as described in `Attention Is
All You Need`. If None, we use 1/sqrt(dk) as described in the paper. All You Need`. If None, we use 1/sqrt(dk) as described in the paper.
scale_by_length: boolean predicate indicating whether additionally scale
the dot product based on key length. Set as log_512^(n) to stablize
attention entropy against length. Refer to
https://kexue.fm/archives/8823 for details.
**kwargs: The same arguments `MultiHeadAttention` layer. **kwargs: The same arguments `MultiHeadAttention` layer.
""" """
if feature_transform not in _TRANSFORM_MAP: if feature_transform not in _TRANSFORM_MAP:
...@@ -214,6 +218,7 @@ class KernelAttention(tf.keras.layers.MultiHeadAttention): ...@@ -214,6 +218,7 @@ class KernelAttention(tf.keras.layers.MultiHeadAttention):
self._redraw = redraw self._redraw = redraw
self._is_short_seq = is_short_seq self._is_short_seq = is_short_seq
self._begin_kernel = begin_kernel self._begin_kernel = begin_kernel
self._scale_by_length = scale_by_length
# We use the seed for two scenarios: # We use the seed for two scenarios:
# 1. inference # 1. inference
# 2. no redraw # 2. no redraw
...@@ -252,9 +257,9 @@ class KernelAttention(tf.keras.layers.MultiHeadAttention): ...@@ -252,9 +257,9 @@ class KernelAttention(tf.keras.layers.MultiHeadAttention):
is_short_seq: boolean predicate indicating whether input data consists of is_short_seq: boolean predicate indicating whether input data consists of
short or long sequences; usually short sequence is defined as having short or long sequences; usually short sequence is defined as having
length L <= 1024. length L <= 1024.
attention_mask: a boolean mask of shape `[B, S]`, that prevents attention_mask: a boolean mask of shape `[B, S]`, that prevents attenting
attenting to masked positions. Note that the mask is only appied to to masked positions. Note that the mask is only appied to the keys. User
the keys. User may want to mask the output if query contains pads. may want to mask the output if query contains pads.
training: Python boolean indicating whether the layer should behave in training: Python boolean indicating whether the layer should behave in
training mode (adding dropout) or in inference mode (doing nothing). training mode (adding dropout) or in inference mode (doing nothing).
numeric_stabler: A scalar value added to avoid divide by 0. numeric_stabler: A scalar value added to avoid divide by 0.
...@@ -270,17 +275,23 @@ class KernelAttention(tf.keras.layers.MultiHeadAttention): ...@@ -270,17 +275,23 @@ class KernelAttention(tf.keras.layers.MultiHeadAttention):
else: else:
projection_matrix = self._projection_matrix projection_matrix = self._projection_matrix
if self._scale_by_length:
scale = tf.math.log(tf.reduce_sum(attention_mask,
axis=-1)) * self._scale / math.log(512)
scale = tf.reshape(scale, [-1, 1, 1, 1])
else:
scale = self._scale
if is_short_seq: if is_short_seq:
# Note: Applying scalar multiply at the smaller end of einsum improves # Note: Applying scalar multiply at the smaller end of einsum improves
# XLA performance, but may introduce slight numeric differences in # XLA performance, but may introduce slight numeric differences in
# the Transformer attention head. # the Transformer attention head.
query = query * self._scale query = query * scale
else: else:
# Note: we suspect spliting the scale to key, query yields smaller # Note: we suspect spliting the scale to key, query yields smaller
# approximation variance when random projection is used. # approximation variance when random projection is used.
# For simplicity, we also split when there's no random projection. # For simplicity, we also split when there's no random projection.
key *= math.sqrt(self._scale) key *= tf.math.sqrt(scale)
query *= math.sqrt(self._scale) query *= tf.math.sqrt(scale)
key = _TRANSFORM_MAP[feature_transform](key, projection_matrix) key = _TRANSFORM_MAP[feature_transform](key, projection_matrix)
query = _TRANSFORM_MAP[feature_transform](query, projection_matrix) query = _TRANSFORM_MAP[feature_transform](query, projection_matrix)
...@@ -330,9 +341,9 @@ class KernelAttention(tf.keras.layers.MultiHeadAttention): ...@@ -330,9 +341,9 @@ class KernelAttention(tf.keras.layers.MultiHeadAttention):
value: Value `Tensor` of shape `[B, S, dim]`. value: Value `Tensor` of shape `[B, S, dim]`.
key: Optional key `Tensor` of shape `[B, S, dim]`. If not given, will use key: Optional key `Tensor` of shape `[B, S, dim]`. If not given, will use
`value` for both `key` and `value`, which is the most common case. `value` for both `key` and `value`, which is the most common case.
attention_mask: a boolean mask of shape `[B, S]`, that prevents attention_mask: a boolean mask of shape `[B, S]`, that prevents attenting
attenting to masked positions. Note that the mask is only appied to to masked positions. Note that the mask is only appied to the keys. User
the keys. User may want to mask the output if query contains pads. may want to mask the output if query contains pads.
training: Python boolean indicating whether the layer should behave in training: Python boolean indicating whether the layer should behave in
training mode (adding dropout) or in inference mode (doing nothing). training mode (adding dropout) or in inference mode (doing nothing).
...@@ -373,9 +384,10 @@ class KernelAttention(tf.keras.layers.MultiHeadAttention): ...@@ -373,9 +384,10 @@ class KernelAttention(tf.keras.layers.MultiHeadAttention):
attention_output = tf.concat( attention_output = tf.concat(
[attention_output_softmax, attention_output_kernel], axis=1) [attention_output_softmax, attention_output_kernel], axis=1)
else: else:
attention_output = self._compute_attention( attention_output = self._compute_attention(query, key, value,
query, key, value, self._feature_transform, self._feature_transform,
self._is_short_seq, attention_mask, training) self._is_short_seq,
attention_mask, training)
# This is actually dropping out entire tokens to attend to, which might # This is actually dropping out entire tokens to attend to, which might
# seem a bit unusual, but is taken from the original Transformer paper. # seem a bit unusual, but is taken from the original Transformer paper.
attention_output = self._dropout_layer(attention_output) attention_output = self._dropout_layer(attention_output)
......
...@@ -30,8 +30,8 @@ _BEGIN_KERNEL = [0, 512] ...@@ -30,8 +30,8 @@ _BEGIN_KERNEL = [0, 512]
class KernelAttentionTest(tf.test.TestCase, parameterized.TestCase): class KernelAttentionTest(tf.test.TestCase, parameterized.TestCase):
@parameterized.parameters(itertools.product( @parameterized.parameters(
_FEATURE_TRANSFORM, [127], _TRAINING, [True, False], itertools.product(_FEATURE_TRANSFORM, [127], _TRAINING, [True, False],
_IS_SHORT_SEQ, _BEGIN_KERNEL)) _IS_SHORT_SEQ, _BEGIN_KERNEL))
def test_attention_projection( def test_attention_projection(
self, feature_transform, num_random_features, training, redraw, is_short, self, feature_transform, num_random_features, training, redraw, is_short,
...@@ -90,6 +90,32 @@ class KernelAttentionTest(tf.test.TestCase, parameterized.TestCase): ...@@ -90,6 +90,32 @@ class KernelAttentionTest(tf.test.TestCase, parameterized.TestCase):
training=training) training=training)
self.assertEqual(output.shape, [batch_size, seq_length, key_dim]) self.assertEqual(output.shape, [batch_size, seq_length, key_dim])
@parameterized.parameters([128, 512])
def test_attention_scale_by_length(self, seq_length):
num_heads = 12
key_dim = 64
batch_size = 2
test_layer = attention.KernelAttention(
num_heads=num_heads,
key_dim=key_dim,
num_random_features=0,
scale_by_length=True)
query = tf.random.normal(
shape=(batch_size, seq_length, key_dim))
value = query
encoder_inputs_mask = tf.ones((batch_size, seq_length), dtype=tf.int32)
masks = tf.cast(encoder_inputs_mask, dtype=tf.float32)
output_scale_by_length = test_layer(
query=query, value=value, attention_mask=masks)
test_layer._scale_by_length = False
output_no_scale_by_length = test_layer(
query=query, value=value, attention_mask=masks)
if seq_length == 512: # Equals because log(seq_length, base=512) = 1.0
self.assertAllClose(output_scale_by_length, output_no_scale_by_length)
else:
self.assertNotAllClose(output_scale_by_length, output_no_scale_by_length)
def test_unsupported_feature_transform(self): def test_unsupported_feature_transform(self):
with self.assertRaisesRegex(ValueError, 'Unsupported feature_transform.*'): with self.assertRaisesRegex(ValueError, 'Unsupported feature_transform.*'):
_ = attention.KernelAttention(feature_transform='test') _ = attention.KernelAttention(feature_transform='test')
......
...@@ -14,6 +14,7 @@ ...@@ -14,6 +14,7 @@
"""Keras-based TransformerEncoder block layer.""" """Keras-based TransformerEncoder block layer."""
from absl import logging
import tensorflow as tf import tensorflow as tf
from official.nlp.modeling.layers import util from official.nlp.modeling.layers import util
...@@ -176,9 +177,9 @@ class TransformerEncoderBlock(tf.keras.layers.Layer): ...@@ -176,9 +177,9 @@ class TransformerEncoderBlock(tf.keras.layers.Layer):
einsum_equation = "...bc,cd->...bd" einsum_equation = "...bc,cd->...bd"
hidden_size = input_tensor_shape[-1] hidden_size = input_tensor_shape[-1]
if hidden_size % self._num_heads != 0: if hidden_size % self._num_heads != 0:
raise ValueError( logging.warning(
"The input size (%d) is not a multiple of the number of attention " "The input size (%d) is not a multiple of the number of attention "
"heads (%d)" % (hidden_size, self._num_heads)) "heads (%d)", hidden_size, self._num_heads)
if self._key_dim is None: if self._key_dim is None:
self._key_dim = int(hidden_size // self._num_heads) self._key_dim = int(hidden_size // self._num_heads)
if self._output_last_dim is None: if self._output_last_dim is None:
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment