diff --git a/.github/README_TEMPLATE.md b/.github/README_TEMPLATE.md index d04a0299536242a6d4c0743242c6616eaac40a97..43dba40f59684df0f79faa341c8de67916313210 100644 --- a/.github/README_TEMPLATE.md +++ b/.github/README_TEMPLATE.md @@ -1,13 +1,13 @@ > :memo: A README.md template for releasing a paper code implementation to a GitHub repository. > -> * Template version: 1.0.2020.125 +> * Template version: 1.0.2020.170 > * Please modify sections depending on needs. # Model name, Paper title, or Project Name > :memo: Add a badge for the ArXiv identifier of your paper (arXiv:YYMM.NNNNN) -[![Paper](http://img.shields.io/badge/paper-arXiv.YYMM.NNNNN-B3181B.svg)](https://arxiv.org/abs/...) +[![Paper](http://img.shields.io/badge/Paper-arXiv.YYMM.NNNNN-B3181B?logo=arXiv)](https://arxiv.org/abs/...) This repository is the official or unofficial implementation of the following paper. @@ -28,8 +28,8 @@ This repository is the official or unofficial implementation of the following pa > :memo: Provide maintainer information. -* Last name, First name ([@GitHub username](https://github.com/username)) -* Last name, First name ([@GitHub username](https://github.com/username)) +* Full name ([@GitHub username](https://github.com/username)) +* Full name ([@GitHub username](https://github.com/username)) ## Table of Contents @@ -37,8 +37,8 @@ This repository is the official or unofficial implementation of the following pa ## Requirements -[![TensorFlow 2.1](https://img.shields.io/badge/tensorflow-2.1-brightgreen)](https://github.com/tensorflow/tensorflow/releases/tag/v2.1.0) -[![Python 3.6](https://img.shields.io/badge/python-3.6-blue.svg)](https://www.python.org/downloads/release/python-360/) +[![TensorFlow 2.1](https://img.shields.io/badge/TensorFlow-2.1-FF6F00?logo=tensorflow)](https://github.com/tensorflow/tensorflow/releases/tag/v2.1.0) +[![Python 3.6](https://img.shields.io/badge/Python-3.6-3776AB)](https://www.python.org/downloads/release/python-360/) > :memo: Provide details of the software required. > @@ -54,6 +54,8 @@ pip install -r requirements.txt ## Results +[![TensorFlow Hub](https://img.shields.io/badge/TF%20Hub-Models-FF6F00?logo=tensorflow)](https://tfhub.dev/...) + > :memo: Provide a table with results. (e.g., accuracy, latency) > > * Provide links to the pre-trained models (checkpoint, SavedModel files). @@ -104,6 +106,8 @@ python3 ... ## License +[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0) + > :memo: Place your license text in a file named LICENSE in the root of the repository. > > * Include information about your license. diff --git a/README.md b/README.md index d9dbe91ddcf56cda804b3b46e7a03fd0faa54d6f..203051feb7acbf3f6501d5c29516841958bedb75 100644 --- a/README.md +++ b/README.md @@ -2,28 +2,34 @@ # Welcome to the Model Garden for TensorFlow -The TensorFlow Model Garden is a repository with a number of different implementations of state-of-the-art (SOTA) models and modeling solutions for TensorFlow users. We aim to demonstrate the best practices for modeling so that TensorFlow users can take full advantage of TensorFlow for their research and product development. +The TensorFlow Model Garden is a repository with a number of different implementations of state-of-the-art (SOTA) models and modeling solutions for TensorFlow users. We aim to demonstrate the best practices for modeling so that TensorFlow users +can take full advantage of TensorFlow for their research and product development. | Directory | Description | |-----------|-------------| | [official](official) | • A collection of example implementations for SOTA models using the latest TensorFlow 2's high-level APIs
• Officially maintained, supported, and kept up to date with the latest TensorFlow 2 APIs by TensorFlow
• Reasonably optimized for fast performance while still being easy to read | | [research](research) | • A collection of research model implementations in TensorFlow 1 or 2 by researchers
• Maintained and supported by researchers | | [community](community) | • A curated list of the GitHub repositories with machine learning models and implementations powered by TensorFlow 2 | +| [orbit](orbit) | • A flexible and lightweight library that users can easily use or fork when writing customized training loop code in TensorFlow 2.x. It seamlessly integrates with `tf.distribute` and supports running on different device types (CPU, GPU, and TPU). | -## [Announcements](../../wiki/Announcements) +## [Announcements](https://github.com/tensorflow/models/wiki/Announcements) | Date | News | |------|------| -| May 21, 2020 | [Unifying Deep Local and Global Features for Image Search (DELG)](https://github.com/tensorflow/models/tree/master/research/delf#delg) code released -| May 7, 2020 | [MnasFPN with MobileNet-V2 backbone](https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/detection_model_zoo.md#mobile-models) released for object detection -| May 1, 2020 | [DELF: DEep Local Features](https://github.com/tensorflow/models/tree/master/research/delf) updated to support TensorFlow 2.1 +| July 10, 2020 | TensorFlow 2 meets the [Object Detection API](https://github.com/tensorflow/models/tree/master/research/object_detection) ([Blog](https://blog.tensorflow.org/2020/07/tensorflow-2-meets-object-detection-api.html)) | +| June 30, 2020 | [SpineNet: Learning Scale-Permuted Backbone for Recognition and Localization](https://github.com/tensorflow/models/tree/master/official/vision/detection#train-a-spinenet-49-based-mask-r-cnn) released ([Tweet](https://twitter.com/GoogleAI/status/1278016712978264064)) | +| June 17, 2020 | [Context R-CNN: Long Term Temporal Context for Per-Camera Object Detection](https://github.com/tensorflow/models/tree/master/research/object_detection#june-17th-2020) released ([Tweet](https://twitter.com/GoogleAI/status/1276571419422253057)) | +| May 21, 2020 | [Unifying Deep Local and Global Features for Image Search (DELG)](https://github.com/tensorflow/models/tree/master/research/delf#delg) code released | +| May 19, 2020 | [MobileDets: Searching for Object Detection Architectures for Mobile Accelerators](https://github.com/tensorflow/models/tree/master/research/object_detection#may-19th-2020) released | +| May 7, 2020 | [MnasFPN with MobileNet-V2 backbone](https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/detection_model_zoo.md#mobile-models) released for object detection | +| May 1, 2020 | [DELF: DEep Local Features](https://github.com/tensorflow/models/tree/master/research/delf) updated to support TensorFlow 2.1 | | March 31, 2020 | [Introducing the Model Garden for TensorFlow 2](https://blog.tensorflow.org/2020/03/introducing-model-garden-for-tensorflow-2.html) ([Tweet](https://twitter.com/TensorFlow/status/1245029834633297921)) | ## Contributions [![help wanted:paper implementation](https://img.shields.io/github/issues/tensorflow/models/help%20wanted%3Apaper%20implementation)](https://github.com/tensorflow/models/labels/help%20wanted%3Apaper%20implementation) -If you want to contribute, please review the [contribution guidelines](../../wiki/How-to-contribute). +If you want to contribute, please review the [contribution guidelines](https://github.com/tensorflow/models/wiki/How-to-contribute). ## License diff --git a/community/README.md b/community/README.md index eea11fc2b63fede7b983e7e3aa9390400be22c6b..ed01dfbed07bca73b321336d59fd5d174545f6cd 100644 --- a/community/README.md +++ b/community/README.md @@ -6,13 +6,12 @@ This repository provides a curated list of the GitHub repositories with machine **Note**: Contributing companies or individuals are responsible for maintaining their repositories. -## Models / Implementations +## Computer Vision -### Computer Vision +### Image Recognition -#### Image Recognition -| Model | Reference (Paper) | Features | Maintainer | -|-------|-------------------|----------|------------| +| Model | Paper | Features | Maintainer | +|-------|-------|----------|------------| | [DenseNet 169](https://github.com/IntelAI/models/tree/master/benchmarks/image_recognition/tensorflow/densenet169) | [Densely Connected Convolutional Networks](https://arxiv.org/pdf/1608.06993) | • FP32 Inference | [Intel](https://github.com/IntelAI) | | [Inception V3](https://github.com/IntelAI/models/tree/master/benchmarks/image_recognition/tensorflow/inceptionv3) | [Rethinking the Inception Architecture
for Computer Vision](https://arxiv.org/pdf/1512.00567.pdf) | • Int8 Inference
• FP32 Inference | [Intel](https://github.com/IntelAI) | | [Inception V4](https://github.com/IntelAI/models/tree/master/benchmarks/image_recognition/tensorflow/inceptionv4) | [Inception-v4, Inception-ResNet and the Impact
of Residual Connections on Learning](https://arxiv.org/pdf/1602.07261) | • Int8 Inference
• FP32 Inference | [Intel](https://github.com/IntelAI) | @@ -21,12 +20,21 @@ This repository provides a curated list of the GitHub repositories with machine | [ResNet 50](https://github.com/IntelAI/models/tree/master/benchmarks/image_recognition/tensorflow/resnet50) | [Deep Residual Learning for Image Recognition](https://arxiv.org/pdf/1512.03385) | • Int8 Inference
• FP32 Inference | [Intel](https://github.com/IntelAI) | | [ResNet 50v1.5](https://github.com/IntelAI/models/tree/master/benchmarks/image_recognition/tensorflow/resnet50v1_5) | [Deep Residual Learning for Image Recognition](https://arxiv.org/pdf/1512.03385) | • Int8 Inference
• FP32 Inference
• FP32 Training | [Intel](https://github.com/IntelAI) | -#### Segmentation -| Model | Reference (Paper) |       Features       | Maintainer | -|-------|-------------------|----------|------------| +### Object Detection + +| Model | Paper | Features | Maintainer | +|-------|-------|----------|------------| +| [R-FCN](https://github.com/IntelAI/models/tree/master/benchmarks/object_detection/tensorflow/rfcn) | [R-FCN: Object Detection
via Region-based Fully Convolutional Networks](https://arxiv.org/pdf/1605.06409) | • Int8 Inference
• FP32 Inference | [Intel](https://github.com/IntelAI) | +| [SSD-MobileNet](https://github.com/IntelAI/models/tree/master/benchmarks/object_detection/tensorflow/ssd-mobilenet) | [MobileNets: Efficient Convolutional Neural Networks
for Mobile Vision Applications](https://arxiv.org/pdf/1704.04861) | • Int8 Inference
• FP32 Inference | [Intel](https://github.com/IntelAI) | +| [SSD-ResNet34](https://github.com/IntelAI/models/tree/master/benchmarks/object_detection/tensorflow/ssd-resnet34) | [SSD: Single Shot MultiBox Detector](https://arxiv.org/pdf/1512.02325) | • Int8 Inference
• FP32 Inference
• FP32 Training | [Intel](https://github.com/IntelAI) | + +### Segmentation + +| Model | Paper | Features | Maintainer | +|-------|-------|----------|------------| | [Mask R-CNN](https://github.com/NVIDIA/DeepLearningExamples/tree/master/TensorFlow2/Segmentation/MaskRCNN) | [Mask R-CNN](https://arxiv.org/abs/1703.06870) | • Automatic Mixed Precision
• Multi-GPU training support with Horovod
• TensorRT | [NVIDIA](https://github.com/NVIDIA) | | [U-Net Medical Image Segmentation](https://github.com/NVIDIA/DeepLearningExamples/tree/master/TensorFlow2/Segmentation/UNet_Medical) | [U-Net: Convolutional Networks for Biomedical Image Segmentation](https://arxiv.org/abs/1505.04597) | • Automatic Mixed Precision
• Multi-GPU training support with Horovod
• TensorRT | [NVIDIA](https://github.com/NVIDIA) | ## Contributions -If you want to contribute, please review the [contribution guidelines](../../../wiki/How-to-contribute). +If you want to contribute, please review the [contribution guidelines](https://github.com/tensorflow/models/wiki/How-to-contribute). diff --git a/official/README.md b/official/README.md index 84fd2e6342f9d7ce9a74fc2c7a3518fa5b7efd17..77e43ea9c15e9a18cfee3fb757016cf5091d0c28 100644 --- a/official/README.md +++ b/official/README.md @@ -17,11 +17,9 @@ with the same or improved speed and performance with each new TensorFlow build. The team is actively developing new models. In the near future, we will add: -* State-of-the-art language understanding models: - More members in Transformer family -* Start-of-the-art image classification models: - EfficientNet, MnasNet, and variants -* A set of excellent objection detection models. +* State-of-the-art language understanding models. +* State-of-the-art image classification models. +* State-of-the-art objection detection and instance segmentation models. ## Table of Contents @@ -43,6 +41,7 @@ In the near future, we will add: |-------|-------------------| | [MNIST](vision/image_classification) | A basic model to classify digits from the [MNIST dataset](http://yann.lecun.com/exdb/mnist/) | | [ResNet](vision/image_classification) | [Deep Residual Learning for Image Recognition](https://arxiv.org/abs/1512.03385) | +| [EfficientNet](vision/image_classification) | [EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks](https://arxiv.org/abs/1905.11946) | #### Object Detection and Segmentation @@ -50,6 +49,8 @@ In the near future, we will add: |-------|-------------------| | [RetinaNet](vision/detection) | [Focal Loss for Dense Object Detection](https://arxiv.org/abs/1708.02002) | | [Mask R-CNN](vision/detection) | [Mask R-CNN](https://arxiv.org/abs/1703.06870) | +| [ShapeMask](vision/detection) | [ShapeMask: Learning to Segment Novel Objects by Refining Shape Priors](https://arxiv.org/abs/1904.03239) | +| [SpineNet](vision/detection) | [SpineNet: Learning Scale-Permuted Backbone for Recognition and Localization](https://arxiv.org/abs/1912.05027) | ### Natural Language Processing diff --git a/official/benchmark/bert_pretrain_benchmark.py b/official/benchmark/bert_pretrain_benchmark.py index d63c894847d8e9e9308523d3efcb06c162d323c6..be14b34b588980036267c9cf29f94475f538304e 100644 --- a/official/benchmark/bert_pretrain_benchmark.py +++ b/official/benchmark/bert_pretrain_benchmark.py @@ -144,6 +144,39 @@ class BertPretrainAccuracyBenchmark(bert_benchmark_utils.BertBenchmarkBase): self._run_and_report_benchmark(summary_path=summary_path, report_accuracy=True) + @owner_utils.Owner('tf-model-garden') + def benchmark_perf_2x2_tpu_bf16_seq128_10k_steps(self): + """Test bert pretraining with 2x2 TPU for 10000 steps.""" + self._setup() + self._specify_common_flags() + FLAGS.num_steps_per_epoch = 5000 + FLAGS.num_train_epochs = 2 + FLAGS.train_batch_size = 128 + FLAGS.model_dir = self._get_model_dir( + 'benchmark_perf_2x2_tpu_bf16_seq128_10k_steps') + summary_path = os.path.join(FLAGS.model_dir, + 'summaries/training_summary.txt') + # Disable accuracy check. + self._run_and_report_benchmark( + summary_path=summary_path, report_accuracy=False) + + @owner_utils.Owner('tf-model-garden') + def benchmark_perf_2x2_tpu_bf16_seq128_10k_steps_mlir(self): + """Test bert pretraining with 2x2 TPU with MLIR for 10000 steps.""" + self._setup() + self._specify_common_flags() + FLAGS.num_steps_per_epoch = 5000 + FLAGS.num_train_epochs = 2 + FLAGS.train_batch_size = 128 + FLAGS.model_dir = self._get_model_dir( + 'benchmark_perf_2x2_tpu_bf16_seq128_10k_steps_mlir') + summary_path = os.path.join(FLAGS.model_dir, + 'summaries/training_summary.txt') + tf.config.experimental.enable_mlir_bridge() + # Disable accuracy check. + self._run_and_report_benchmark( + summary_path=summary_path, report_accuracy=False) + @owner_utils.Owner('tf-model-garden') def benchmark_perf_4x4_tpu_bf16_seq128_10k_steps(self): """Test bert pretraining with 4x4 TPU for 10000 steps.""" @@ -159,6 +192,22 @@ class BertPretrainAccuracyBenchmark(bert_benchmark_utils.BertBenchmarkBase): self._run_and_report_benchmark( summary_path=summary_path, report_accuracy=False) + @owner_utils.Owner('tf-model-garden') + def benchmark_perf_4x4_tpu_bf16_seq128_10k_steps_mlir(self): + """Test bert pretraining with 4x4 TPU with MLIR for 10000 steps.""" + self._setup() + self._specify_common_flags() + FLAGS.num_steps_per_epoch = 5000 + FLAGS.num_train_epochs = 2 + FLAGS.model_dir = self._get_model_dir( + 'benchmark_perf_4x4_tpu_bf16_seq128_10k_steps_mlir') + summary_path = os.path.join(FLAGS.model_dir, + 'summaries/training_summary.txt') + tf.config.experimental.enable_mlir_bridge() + # Disable accuracy check. + self._run_and_report_benchmark( + summary_path=summary_path, report_accuracy=False) + @owner_utils.Owner('tf-model-garden') def benchmark_perf_8x8_tpu_bf16_seq128_10k_steps(self): """Test bert pretraining with 8x8 TPU for 10000 steps.""" diff --git a/official/benchmark/keras_imagenet_benchmark.py b/official/benchmark/keras_imagenet_benchmark.py index 63a48dfb1222b65311652e3bee4241854a55043e..9dfcede08b64f6670b010d389c554a2be9dac035 100644 --- a/official/benchmark/keras_imagenet_benchmark.py +++ b/official/benchmark/keras_imagenet_benchmark.py @@ -299,20 +299,21 @@ class MobilenetV1KerasAccuracy(keras_benchmark.KerasBenchmark): return os.path.join(self.output_dir, folder_name) -class Resnet50KerasClassifierBenchmarkBase(keras_benchmark.KerasBenchmark): - """Resnet50 (classifier_trainer) benchmarks.""" +class KerasClassifierBenchmarkBase(keras_benchmark.KerasBenchmark): + """Classifier Trainer benchmarks.""" - def __init__(self, output_dir=None, default_flags=None, + def __init__(self, model, output_dir=None, default_flags=None, tpu=None, dataset_builder='records', train_epochs=1, train_steps=110, data_dir=None): flag_methods = [classifier_trainer.define_classifier_flags] + self.model = model self.dataset_builder = dataset_builder self.train_epochs = train_epochs self.train_steps = train_steps self.data_dir = data_dir - super(Resnet50KerasClassifierBenchmarkBase, self).__init__( + super(KerasClassifierBenchmarkBase, self).__init__( output_dir=output_dir, flag_methods=flag_methods, default_flags=default_flags, @@ -337,7 +338,7 @@ class Resnet50KerasClassifierBenchmarkBase(keras_benchmark.KerasBenchmark): dataset_num_private_threads: Optional[int] = None, loss_scale: Optional[str] = None): """Runs and reports the benchmark given the provided configuration.""" - FLAGS.model_type = 'resnet' + FLAGS.model_type = self.model FLAGS.dataset = 'imagenet' FLAGS.mode = 'train_and_eval' FLAGS.data_dir = self.data_dir @@ -372,7 +373,7 @@ class Resnet50KerasClassifierBenchmarkBase(keras_benchmark.KerasBenchmark): # input skip_steps. warmup = (skip_steps or (self.train_steps - 100)) // FLAGS.log_steps - super(Resnet50KerasClassifierBenchmarkBase, self)._report_benchmark( + super(KerasClassifierBenchmarkBase, self)._report_benchmark( stats, wall_time_sec, total_batch_size=total_batch_size, @@ -599,8 +600,7 @@ class Resnet50KerasClassifierBenchmarkBase(keras_benchmark.KerasBenchmark): distribution_strategy='mirrored', per_replica_batch_size=256, gpu_thread_mode='gpu_private', - dataset_num_private_threads=48, - steps=310) + dataset_num_private_threads=48) def benchmark_xla_8_gpu_fp16_dynamic_tweaked(self): """Tests Keras model with config tuning, XLA, 8 GPUs and dynamic fp16.""" @@ -636,6 +636,28 @@ class Resnet50KerasClassifierBenchmarkBase(keras_benchmark.KerasBenchmark): distribution_strategy='tpu', per_replica_batch_size=128) + def benchmark_2x2_tpu_bf16_mlir(self): + """Test Keras model with 2x2 TPU, bf16.""" + self._setup() + tf.config.experimental.enable_mlir_bridge() + self._run_and_report_benchmark( + experiment_name='benchmark_2x2_tpu_bf16_mlir', + dtype='bfloat16', + num_tpus=8, + distribution_strategy='tpu', + per_replica_batch_size=128) + + def benchmark_4x4_tpu_bf16_mlir(self): + """Test Keras model with 4x4 TPU, bf16.""" + self._setup() + tf.config.experimental.enable_mlir_bridge() + self._run_and_report_benchmark( + experiment_name='benchmark_4x4_tpu_bf16_mlir', + dtype='bfloat16', + num_tpus=32, + distribution_strategy='tpu', + per_replica_batch_size=128) + def benchmark_8x8_tpu_bf16(self): """Test Keras model with 8x8 TPU, bf16.""" self._setup() @@ -647,7 +669,7 @@ class Resnet50KerasClassifierBenchmarkBase(keras_benchmark.KerasBenchmark): per_replica_batch_size=64) def fill_report_object(self, stats): - super(Resnet50KerasClassifierBenchmarkBase, self).fill_report_object( + super(KerasClassifierBenchmarkBase, self).fill_report_object( stats, total_batch_size=FLAGS.batch_size, log_steps=FLAGS.log_steps) @@ -1086,7 +1108,7 @@ class Resnet50KerasBenchmarkBase(keras_benchmark.KerasBenchmark): log_steps=FLAGS.log_steps) -class Resnet50KerasBenchmarkSynth(Resnet50KerasClassifierBenchmarkBase): +class Resnet50KerasBenchmarkSynth(KerasClassifierBenchmarkBase): """Resnet50 synthetic benchmark tests.""" def __init__(self, output_dir=None, root_data_dir=None, tpu=None, **kwargs): @@ -1094,11 +1116,11 @@ class Resnet50KerasBenchmarkSynth(Resnet50KerasClassifierBenchmarkBase): def_flags['log_steps'] = 10 super(Resnet50KerasBenchmarkSynth, self).__init__( - output_dir=output_dir, default_flags=def_flags, tpu=tpu, + model='resnet', output_dir=output_dir, default_flags=def_flags, tpu=tpu, dataset_builder='synthetic', train_epochs=1, train_steps=110) -class Resnet50KerasBenchmarkReal(Resnet50KerasClassifierBenchmarkBase): +class Resnet50KerasBenchmarkReal(KerasClassifierBenchmarkBase): """Resnet50 real data benchmark tests.""" def __init__(self, output_dir=None, root_data_dir=None, tpu=None, **kwargs): @@ -1107,11 +1129,25 @@ class Resnet50KerasBenchmarkReal(Resnet50KerasClassifierBenchmarkBase): def_flags['log_steps'] = 10 super(Resnet50KerasBenchmarkReal, self).__init__( - output_dir=output_dir, default_flags=def_flags, tpu=tpu, + model='resnet', output_dir=output_dir, default_flags=def_flags, tpu=tpu, dataset_builder='records', train_epochs=1, train_steps=110, data_dir=data_dir) +class EfficientNetKerasBenchmarkReal(KerasClassifierBenchmarkBase): + """EfficientNet real data benchmark tests.""" + + def __init__(self, output_dir=None, root_data_dir=None, tpu=None, **kwargs): + data_dir = os.path.join(root_data_dir, 'imagenet') + def_flags = {} + def_flags['log_steps'] = 10 + + super(EfficientNetKerasBenchmarkReal, self).__init__( + model='efficientnet', output_dir=output_dir, default_flags=def_flags, + tpu=tpu, dataset_builder='records', train_epochs=1, train_steps=110, + data_dir=data_dir) + + class Resnet50KerasBenchmarkRemoteData(Resnet50KerasBenchmarkBase): """Resnet50 real data (stored in remote storage) benchmark tests.""" diff --git a/official/benchmark/resnet_ctl_imagenet_benchmark.py b/official/benchmark/resnet_ctl_imagenet_benchmark.py index 0e70e8da969ec9b02a2de00d1973bdd2aa5f2b51..f4a7f4bd5e797965d880900324d2969dbc0130ba 100644 --- a/official/benchmark/resnet_ctl_imagenet_benchmark.py +++ b/official/benchmark/resnet_ctl_imagenet_benchmark.py @@ -38,13 +38,18 @@ FLAGS = flags.FLAGS class CtlBenchmark(PerfZeroBenchmark): """Base benchmark class with methods to simplify testing.""" - def __init__(self, output_dir=None, default_flags=None, flag_methods=None): + def __init__(self, + output_dir=None, + default_flags=None, + flag_methods=None, + **kwargs): self.default_flags = default_flags or {} self.flag_methods = flag_methods or {} super(CtlBenchmark, self).__init__( output_dir=output_dir, default_flags=self.default_flags, - flag_methods=self.flag_methods) + flag_methods=self.flag_methods, + **kwargs) def _report_benchmark(self, stats, @@ -190,13 +195,14 @@ class Resnet50CtlAccuracy(CtlBenchmark): class Resnet50CtlBenchmarkBase(CtlBenchmark): """Resnet50 benchmarks.""" - def __init__(self, output_dir=None, default_flags=None): + def __init__(self, output_dir=None, default_flags=None, **kwargs): flag_methods = [common.define_keras_flags] super(Resnet50CtlBenchmarkBase, self).__init__( output_dir=output_dir, flag_methods=flag_methods, - default_flags=default_flags) + default_flags=default_flags, + **kwargs) @benchmark_wrappers.enable_runtime_flags def _run_and_report_benchmark(self): @@ -381,12 +387,24 @@ class Resnet50CtlBenchmarkBase(CtlBenchmark): FLAGS.single_l2_loss_op = True FLAGS.use_tf_function = True FLAGS.enable_checkpoint_and_export = False + FLAGS.data_dir = 'gs://mlcompass-data/imagenet/imagenet-2012-tfrecord' def benchmark_2x2_tpu_bf16(self): self._setup() self._set_df_common() FLAGS.batch_size = 1024 FLAGS.dtype = 'bf16' + FLAGS.model_dir = self._get_model_dir('benchmark_2x2_tpu_bf16') + self._run_and_report_benchmark() + + @owner_utils.Owner('tf-graph-compiler') + def benchmark_2x2_tpu_bf16_mlir(self): + self._setup() + self._set_df_common() + FLAGS.batch_size = 1024 + FLAGS.dtype = 'bf16' + tf.config.experimental.enable_mlir_bridge() + FLAGS.model_dir = self._get_model_dir('benchmark_2x2_tpu_bf16_mlir') self._run_and_report_benchmark() def benchmark_4x4_tpu_bf16(self): @@ -394,6 +412,7 @@ class Resnet50CtlBenchmarkBase(CtlBenchmark): self._set_df_common() FLAGS.batch_size = 4096 FLAGS.dtype = 'bf16' + FLAGS.model_dir = self._get_model_dir('benchmark_4x4_tpu_bf16') self._run_and_report_benchmark() @owner_utils.Owner('tf-graph-compiler') @@ -403,6 +422,7 @@ class Resnet50CtlBenchmarkBase(CtlBenchmark): self._set_df_common() FLAGS.batch_size = 4096 FLAGS.dtype = 'bf16' + FLAGS.model_dir = self._get_model_dir('benchmark_4x4_tpu_bf16_mlir') tf.config.experimental.enable_mlir_bridge() self._run_and_report_benchmark() @@ -426,11 +446,11 @@ class Resnet50CtlBenchmarkSynth(Resnet50CtlBenchmarkBase): def_flags['skip_eval'] = True def_flags['use_synthetic_data'] = True def_flags['train_steps'] = 110 - def_flags['steps_per_loop'] = 20 + def_flags['steps_per_loop'] = 10 def_flags['log_steps'] = 10 super(Resnet50CtlBenchmarkSynth, self).__init__( - output_dir=output_dir, default_flags=def_flags) + output_dir=output_dir, default_flags=def_flags, **kwargs) class Resnet50CtlBenchmarkReal(Resnet50CtlBenchmarkBase): @@ -441,11 +461,11 @@ class Resnet50CtlBenchmarkReal(Resnet50CtlBenchmarkBase): def_flags['skip_eval'] = True def_flags['data_dir'] = os.path.join(root_data_dir, 'imagenet') def_flags['train_steps'] = 110 - def_flags['steps_per_loop'] = 20 + def_flags['steps_per_loop'] = 10 def_flags['log_steps'] = 10 super(Resnet50CtlBenchmarkReal, self).__init__( - output_dir=output_dir, default_flags=def_flags) + output_dir=output_dir, default_flags=def_flags, **kwargs) if __name__ == '__main__': diff --git a/official/benchmark/retinanet_benchmark.py b/official/benchmark/retinanet_benchmark.py index 62bc80eef1fd00d5087af5522561ff7cf7863f5e..3b87fd21294ac1aa9334579b31b861f77e32399c 100644 --- a/official/benchmark/retinanet_benchmark.py +++ b/official/benchmark/retinanet_benchmark.py @@ -44,11 +44,11 @@ RESNET_CHECKPOINT_PATH = 'gs://cloud-tpu-checkpoints/retinanet/resnet50-checkpoi # pylint: enable=line-too-long -class DetectionBenchmarkBase(perfzero_benchmark.PerfZeroBenchmark): +class BenchmarkBase(perfzero_benchmark.PerfZeroBenchmark): """Base class to hold methods common to test classes.""" def __init__(self, **kwargs): - super(DetectionBenchmarkBase, self).__init__(**kwargs) + super(BenchmarkBase, self).__init__(**kwargs) self.timer_callback = None def _report_benchmark(self, stats, start_time_sec, wall_time_sec, min_ap, @@ -99,7 +99,7 @@ class DetectionBenchmarkBase(perfzero_benchmark.PerfZeroBenchmark): extras={'flags': flags_str}) -class RetinanetBenchmarkBase(DetectionBenchmarkBase): +class DetectionBenchmarkBase(BenchmarkBase): """Base class to hold methods common to test classes in the module.""" def __init__(self, **kwargs): @@ -107,7 +107,7 @@ class RetinanetBenchmarkBase(DetectionBenchmarkBase): self.eval_data_path = COCO_EVAL_DATA self.eval_json_path = COCO_EVAL_JSON self.resnet_checkpoint_path = RESNET_CHECKPOINT_PATH - super(RetinanetBenchmarkBase, self).__init__(**kwargs) + super(DetectionBenchmarkBase, self).__init__(**kwargs) def _run_detection_main(self): """Starts detection job.""" @@ -118,7 +118,7 @@ class RetinanetBenchmarkBase(DetectionBenchmarkBase): return detection.run() -class RetinanetAccuracy(RetinanetBenchmarkBase): +class DetectionAccuracy(DetectionBenchmarkBase): """Accuracy test for RetinaNet model. Tests RetinaNet detection task model accuracy. The naming @@ -126,6 +126,10 @@ class RetinanetAccuracy(RetinanetBenchmarkBase): `benchmark_(number of gpus)_gpu_(dataset type)` format. """ + def __init__(self, model, **kwargs): + self.model = model + super(DetectionAccuracy, self).__init__(**kwargs) + @benchmark_wrappers.enable_runtime_flags def _run_and_report_benchmark(self, params, @@ -133,7 +137,7 @@ class RetinanetAccuracy(RetinanetBenchmarkBase): max_ap=0.35, do_eval=True, warmup=1): - """Starts RetinaNet accuracy benchmark test.""" + """Starts Detection accuracy benchmark test.""" FLAGS.params_override = json.dumps(params) # Need timer callback to measure performance self.timer_callback = keras_utils.TimeHistory( @@ -156,8 +160,8 @@ class RetinanetAccuracy(RetinanetBenchmarkBase): max_ap, warmup) def _setup(self): - super(RetinanetAccuracy, self)._setup() - FLAGS.model = 'retinanet' + super(DetectionAccuracy, self)._setup() + FLAGS.model = self.model def _params(self): return { @@ -195,22 +199,22 @@ class RetinanetAccuracy(RetinanetBenchmarkBase): self._run_and_report_benchmark(params) -class RetinanetBenchmarkReal(RetinanetAccuracy): - """Short benchmark performance tests for RetinaNet model. +class DetectionBenchmarkReal(DetectionAccuracy): + """Short benchmark performance tests for a detection model. - Tests RetinaNet performance in different GPU configurations. + Tests detection performance in different accelerator configurations. The naming convention of below test cases follow `benchmark_(number of gpus)_gpu` format. """ def _setup(self): - super(RetinanetBenchmarkReal, self)._setup() + super(DetectionBenchmarkReal, self)._setup() # Use negative value to avoid saving checkpoints. FLAGS.save_checkpoint_freq = -1 @flagsaver.flagsaver def benchmark_8_gpu_coco(self): - """Run RetinaNet model accuracy test with 8 GPUs.""" + """Run detection model accuracy test with 8 GPUs.""" self._setup() params = self._params() params['architecture']['use_bfloat16'] = False @@ -230,7 +234,7 @@ class RetinanetBenchmarkReal(RetinanetAccuracy): @flagsaver.flagsaver def benchmark_1_gpu_coco(self): - """Run RetinaNet model accuracy test with 1 GPU.""" + """Run detection model accuracy test with 1 GPU.""" self._setup() params = self._params() params['architecture']['use_bfloat16'] = False @@ -245,7 +249,7 @@ class RetinanetBenchmarkReal(RetinanetAccuracy): @flagsaver.flagsaver def benchmark_xla_1_gpu_coco(self): - """Run RetinaNet model accuracy test with 1 GPU and XLA enabled.""" + """Run detection model accuracy test with 1 GPU and XLA enabled.""" self._setup() params = self._params() params['architecture']['use_bfloat16'] = False @@ -261,7 +265,7 @@ class RetinanetBenchmarkReal(RetinanetAccuracy): @flagsaver.flagsaver def benchmark_2x2_tpu_coco(self): - """Run RetinaNet model accuracy test with 4 TPUs.""" + """Run detection model accuracy test with 4 TPUs.""" self._setup() params = self._params() params['train']['batch_size'] = 64 @@ -271,6 +275,88 @@ class RetinanetBenchmarkReal(RetinanetAccuracy): FLAGS.strategy_type = 'tpu' self._run_and_report_benchmark(params, do_eval=False, warmup=0) + @flagsaver.flagsaver + def benchmark_4x4_tpu_coco(self): + """Run detection model accuracy test with 4 TPUs.""" + self._setup() + params = self._params() + params['train']['batch_size'] = 256 + params['train']['total_steps'] = 469 # One epoch. + params['train']['iterations_per_loop'] = 500 + FLAGS.model_dir = self._get_model_dir('real_benchmark_4x4_tpu_coco') + FLAGS.strategy_type = 'tpu' + self._run_and_report_benchmark(params, do_eval=False, warmup=0) + + @flagsaver.flagsaver + def benchmark_2x2_tpu_coco_mlir(self): + """Run detection model accuracy test with 4 TPUs.""" + self._setup() + params = self._params() + params['train']['batch_size'] = 64 + params['train']['total_steps'] = 1875 # One epoch. + params['train']['iterations_per_loop'] = 500 + FLAGS.model_dir = self._get_model_dir('real_benchmark_2x2_tpu_coco_mlir') + FLAGS.strategy_type = 'tpu' + tf.config.experimental.enable_mlir_bridge() + self._run_and_report_benchmark(params, do_eval=False, warmup=0) + + @flagsaver.flagsaver + def benchmark_4x4_tpu_coco_mlir(self): + """Run RetinaNet model accuracy test with 4 TPUs.""" + self._setup() + params = self._params() + params['train']['batch_size'] = 256 + params['train']['total_steps'] = 469 # One epoch. + params['train']['iterations_per_loop'] = 500 + FLAGS.model_dir = self._get_model_dir('real_benchmark_4x4_tpu_coco_mlir') + FLAGS.strategy_type = 'tpu' + tf.config.experimental.enable_mlir_bridge() + self._run_and_report_benchmark(params, do_eval=False, warmup=0) + + @flagsaver.flagsaver + def benchmark_2x2_tpu_spinenet_coco(self): + """Run detection model with SpineNet backbone accuracy test with 4 TPUs.""" + self._setup() + params = self._params() + params['architecture']['backbone'] = 'spinenet' + params['architecture']['multilevel_features'] = 'identity' + params['architecture']['use_bfloat16'] = False + params['train']['batch_size'] = 64 + params['train']['total_steps'] = 1875 # One epoch. + params['train']['iterations_per_loop'] = 500 + params['train']['checkpoint']['path'] = '' + FLAGS.model_dir = self._get_model_dir( + 'real_benchmark_2x2_tpu_spinenet_coco') + FLAGS.strategy_type = 'tpu' + self._run_and_report_benchmark(params, do_eval=False, warmup=0) + + +class RetinanetBenchmarkReal(DetectionBenchmarkReal): + """Short benchmark performance tests for Retinanet model.""" + + def __init__(self, **kwargs): + super(RetinanetBenchmarkReal, self).__init__( + model='retinanet', + **kwargs) + + +class MaskRCNNBenchmarkReal(DetectionBenchmarkReal): + """Short benchmark performance tests for Mask RCNN model.""" + + def __init__(self, **kwargs): + super(MaskRCNNBenchmarkReal, self).__init__( + model='mask_rcnn', + **kwargs) + + +class ShapeMaskBenchmarkReal(DetectionBenchmarkReal): + """Short benchmark performance tests for ShapeMask model.""" + + def __init__(self, **kwargs): + super(ShapeMaskBenchmarkReal, self).__init__( + model='shapemask', + **kwargs) + if __name__ == '__main__': tf.test.main() diff --git a/official/benchmark/transformer_benchmark.py b/official/benchmark/transformer_benchmark.py index e61201aa174af4882c6dbab28e10fe64d8cc1377..597b9465c81875ca28c276676146b1aec04c4674 100644 --- a/official/benchmark/transformer_benchmark.py +++ b/official/benchmark/transformer_benchmark.py @@ -29,6 +29,8 @@ from official.nlp.transformer import misc from official.nlp.transformer import transformer_main as transformer_main from official.utils.flags import core as flags_core +TPU_DATA_DIR = 'gs://mlcompass-data/transformer' +GPU_DATA_DIR = os.getenv('TMPDIR') TRANSFORMER_EN2DE_DATA_DIR_NAME = 'wmt32k-en2de-official' EN2DE_2014_BLEU_DATA_DIR_NAME = 'newstest2014' FLAGS = flags.FLAGS @@ -40,37 +42,54 @@ class TransformerBenchmark(PerfZeroBenchmark): Code under test for the Transformer Keras models report the same data and require the same FLAG setup. + """ def __init__(self, output_dir=None, default_flags=None, root_data_dir=None, flag_methods=None, tpu=None): + self._set_data_files(root_data_dir=root_data_dir) + + if default_flags is None: + default_flags = {} + default_flags['data_dir'] = self.train_data_dir + default_flags['vocab_file'] = self.vocab_file + + super(TransformerBenchmark, self).__init__( + output_dir=output_dir, + default_flags=default_flags, + flag_methods=flag_methods, + tpu=tpu) + + def _set_data_files(self, root_data_dir=None, tpu_run=False): + """Sets train_data_dir, vocab_file, bleu_source and bleu_ref.""" + # Use remote storage for TPU, remote storage for GPU if defined, else + # use environment provided root_data_dir. + if tpu_run: + root_data_dir = TPU_DATA_DIR + elif GPU_DATA_DIR is not None: + root_data_dir = GPU_DATA_DIR + root_data_dir = root_data_dir if root_data_dir else '' self.train_data_dir = os.path.join(root_data_dir, TRANSFORMER_EN2DE_DATA_DIR_NAME) - self.vocab_file = os.path.join(root_data_dir, TRANSFORMER_EN2DE_DATA_DIR_NAME, 'vocab.ende.32768') - self.bleu_source = os.path.join(root_data_dir, EN2DE_2014_BLEU_DATA_DIR_NAME, 'newstest2014.en') - self.bleu_ref = os.path.join(root_data_dir, EN2DE_2014_BLEU_DATA_DIR_NAME, 'newstest2014.de') - if default_flags is None: - default_flags = {} - default_flags['data_dir'] = self.train_data_dir - default_flags['vocab_file'] = self.vocab_file - - super(TransformerBenchmark, self).__init__( - output_dir=output_dir, - default_flags=default_flags, - flag_methods=flag_methods, - tpu=tpu) + def _set_data_file_flags(self): + """Sets the FLAGS for the data files.""" + FLAGS.data_dir = self.train_data_dir + FLAGS.vocab_file = self.vocab_file + # Sets values directly to avoid validation check. + FLAGS['bleu_source'].value = self.bleu_source + FLAGS['bleu_ref'].value = self.bleu_ref @benchmark_wrappers.enable_runtime_flags def _run_and_report_benchmark(self, @@ -164,12 +183,8 @@ class TransformerBaseKerasAccuracy(TransformerBenchmark): not converge to the 27.3 BLEU (uncased) SOTA. """ self._setup() + self._set_data_file_flags() FLAGS.num_gpus = 1 - FLAGS.data_dir = self.train_data_dir - FLAGS.vocab_file = self.vocab_file - # Sets values directly to avoid validation check. - FLAGS['bleu_source'].value = self.bleu_source - FLAGS['bleu_ref'].value = self.bleu_ref FLAGS.param_set = 'base' FLAGS.batch_size = 2048 FLAGS.train_steps = 1000 @@ -189,12 +204,8 @@ class TransformerBaseKerasAccuracy(TransformerBenchmark): not converge to the 27.3 BLEU (uncased) SOTA. """ self._setup() + self._set_data_file_flags() FLAGS.num_gpus = 1 - FLAGS.data_dir = self.train_data_dir - FLAGS.vocab_file = self.vocab_file - # Sets values directly to avoid validation check. - FLAGS['bleu_source'].value = self.bleu_source - FLAGS['bleu_ref'].value = self.bleu_ref FLAGS.param_set = 'base' FLAGS.batch_size = 4096 FLAGS.train_steps = 100000 @@ -215,12 +226,8 @@ class TransformerBaseKerasAccuracy(TransformerBenchmark): Should converge to 27.3 BLEU (uncased). This has not been confirmed yet. """ self._setup() + self._set_data_file_flags() FLAGS.num_gpus = 8 - FLAGS.data_dir = self.train_data_dir - FLAGS.vocab_file = self.vocab_file - # Sets values directly to avoid validation check. - FLAGS['bleu_source'].value = self.bleu_source - FLAGS['bleu_ref'].value = self.bleu_ref FLAGS.param_set = 'base' FLAGS.batch_size = 4096*8 FLAGS.train_steps = 100000 @@ -237,12 +244,8 @@ class TransformerBaseKerasAccuracy(TransformerBenchmark): Should converge to 27.3 BLEU (uncased). This has not been confirmed yet. """ self._setup() + self._set_data_file_flags() FLAGS.num_gpus = 8 - FLAGS.data_dir = self.train_data_dir - FLAGS.vocab_file = self.vocab_file - # Sets values directly to avoid validation check. - FLAGS['bleu_source'].value = self.bleu_source - FLAGS['bleu_ref'].value = self.bleu_ref FLAGS.param_set = 'base' FLAGS.batch_size = 4096*8 FLAGS.train_steps = 100000 @@ -284,12 +287,8 @@ class TransformerBigKerasAccuracy(TransformerBenchmark): Iterations are not epochs, an iteration is a number of steps between evals. """ self._setup() + self._set_data_file_flags() FLAGS.num_gpus = 8 - FLAGS.data_dir = self.train_data_dir - FLAGS.vocab_file = self.vocab_file - # Sets values directly to avoid validation check. - FLAGS['bleu_source'].value = self.bleu_source - FLAGS['bleu_ref'].value = self.bleu_ref FLAGS.param_set = 'big' FLAGS.batch_size = 3072*8 FLAGS.train_steps = 20000 * 12 @@ -306,12 +305,8 @@ class TransformerBigKerasAccuracy(TransformerBenchmark): Should converge to 28.4 BLEU (uncased). This has not be verified yet." """ self._setup() + self._set_data_file_flags() FLAGS.num_gpus = 8 - FLAGS.data_dir = self.train_data_dir - FLAGS.vocab_file = self.vocab_file - # Sets values directly to avoid validation check. - FLAGS['bleu_source'].value = self.bleu_source - FLAGS['bleu_ref'].value = self.bleu_ref FLAGS.param_set = 'big' FLAGS.batch_size = 3072*8 FLAGS.static_batch = True @@ -337,13 +332,9 @@ class TransformerBigKerasAccuracy(TransformerBenchmark): not epochs, an iteration is a number of steps between evals. """ self._setup() + self._set_data_file_flags() FLAGS.num_gpus = 8 FLAGS.dtype = 'fp16' - FLAGS.data_dir = self.train_data_dir - FLAGS.vocab_file = self.vocab_file - # Sets values directly to avoid validation check. - FLAGS['bleu_source'].value = self.bleu_source - FLAGS['bleu_ref'].value = self.bleu_ref FLAGS.param_set = 'big' FLAGS.batch_size = 3072*8 FLAGS.train_steps = 20000 * 12 @@ -360,14 +351,10 @@ class TransformerBigKerasAccuracy(TransformerBenchmark): Should converge to 28.4 BLEU (uncased). This has not be verified yet." """ self._setup() + self._set_data_file_flags() FLAGS.num_gpus = 8 FLAGS.dtype = 'fp16' FLAGS.fp16_implementation = 'graph_rewrite' - FLAGS.data_dir = self.train_data_dir - FLAGS.vocab_file = self.vocab_file - # Sets values directly to avoid validation check. - FLAGS['bleu_source'].value = self.bleu_source - FLAGS['bleu_ref'].value = self.bleu_ref FLAGS.param_set = 'big' FLAGS.batch_size = 3072*8 FLAGS.train_steps = 20000 * 12 @@ -384,13 +371,9 @@ class TransformerBigKerasAccuracy(TransformerBenchmark): Should converge to 28.4 BLEU (uncased). This has not be verified yet." """ self._setup() + self._set_data_file_flags() FLAGS.num_gpus = 8 FLAGS.dtype = 'fp16' - FLAGS.data_dir = self.train_data_dir - FLAGS.vocab_file = self.vocab_file - # Sets values directly to avoid validation check. - FLAGS['bleu_source'].value = self.bleu_source - FLAGS['bleu_ref'].value = self.bleu_ref FLAGS.param_set = 'big' FLAGS.batch_size = 3072*8 FLAGS.static_batch = True @@ -409,14 +392,10 @@ class TransformerBigKerasAccuracy(TransformerBenchmark): Should converge to 28.4 BLEU (uncased). This has not be verified yet." """ self._setup() + self._set_data_file_flags() FLAGS.num_gpus = 8 FLAGS.dtype = 'fp16' FLAGS.enable_xla = True - FLAGS.data_dir = self.train_data_dir - FLAGS.vocab_file = self.vocab_file - # Sets values directly to avoid validation check. - FLAGS['bleu_source'].value = self.bleu_source - FLAGS['bleu_ref'].value = self.bleu_ref FLAGS.param_set = 'big' FLAGS.batch_size = 3072*8 FLAGS.static_batch = True @@ -687,22 +666,41 @@ class TransformerBigKerasBenchmarkReal(TransformerKerasBenchmark): root_data_dir=root_data_dir, batch_per_gpu=3072, tpu=tpu) - def benchmark_2x2_tpu(self): - """Port of former snaggletooth transformer_big model on 2x2.""" - self._setup() - FLAGS.model_dir = self._get_model_dir('benchmark_2x2_tpu') + def _set_df_common(self): + self._set_data_files(tpu_run=True) + FLAGS.data_dir = self.train_data_dir + FLAGS.vocab_file = self.vocab_file + FLAGS.distribution_strategy = 'tpu' + FLAGS.padded_decode = True FLAGS.train_steps = 300 FLAGS.log_steps = 150 FLAGS.steps_between_evals = 150 - FLAGS.distribution_strategy = 'tpu' FLAGS.static_batch = True FLAGS.use_ctl = True - FLAGS.batch_size = 6144 + FLAGS.enable_checkpointing = False FLAGS.max_length = 64 FLAGS.decode_batch_size = 32 FLAGS.decode_max_length = 97 - FLAGS.padded_decode = True - FLAGS.enable_checkpointing = False + + def benchmark_2x2_tpu(self): + """Port of former snaggletooth transformer_big model on 2x2.""" + self._setup() + self._set_df_common() + FLAGS.model_dir = self._get_model_dir('benchmark_2x2_tpu') + FLAGS.batch_size = 6144 + + self._run_and_report_benchmark( + total_batch_size=FLAGS.batch_size, + log_steps=FLAGS.log_steps) + + @owner_utils.Owner('tf-graph-compiler') + def benchmark_2x2_tpu_mlir(self): + """Run transformer_big model on 2x2 with the MLIR Bridge enabled.""" + self._setup() + self._set_df_common() + FLAGS.model_dir = self._get_model_dir('benchmark_2x2_tpu_mlir') + FLAGS.batch_size = 6144 + tf.config.experimental.enable_mlir_bridge() self._run_and_report_benchmark( total_batch_size=FLAGS.batch_size, @@ -711,19 +709,9 @@ class TransformerBigKerasBenchmarkReal(TransformerKerasBenchmark): def benchmark_4x4_tpu(self): """Port of former GCP transformer_big model on 4x4.""" self._setup() + self._set_df_common() FLAGS.model_dir = self._get_model_dir('benchmark_4x4_tpu') - FLAGS.train_steps = 300 - FLAGS.log_steps = 150 - FLAGS.steps_between_evals = 150 - FLAGS.distribution_strategy = 'tpu' - FLAGS.static_batch = True - FLAGS.use_ctl = True FLAGS.batch_size = 24576 - FLAGS.max_length = 64 - FLAGS.decode_batch_size = 32 - FLAGS.decode_max_length = 97 - FLAGS.padded_decode = True - FLAGS.enable_checkpointing = False self._run_and_report_benchmark( total_batch_size=FLAGS.batch_size, @@ -733,19 +721,9 @@ class TransformerBigKerasBenchmarkReal(TransformerKerasBenchmark): def benchmark_4x4_tpu_mlir(self): """Run transformer_big model on 4x4 with the MLIR Bridge enabled.""" self._setup() - FLAGS.model_dir = self._get_model_dir('benchmark_4x4_tpu') - FLAGS.train_steps = 300 - FLAGS.log_steps = 150 - FLAGS.steps_between_evals = 150 - FLAGS.distribution_strategy = 'tpu' - FLAGS.static_batch = True - FLAGS.use_ctl = True + self._set_df_common() + FLAGS.model_dir = self._get_model_dir('benchmark_4x4_tpu_mlir') FLAGS.batch_size = 24576 - FLAGS.max_length = 64 - FLAGS.decode_batch_size = 32 - FLAGS.decode_max_length = 97 - FLAGS.padded_decode = True - FLAGS.enable_checkpointing = False tf.config.experimental.enable_mlir_bridge() self._run_and_report_benchmark( diff --git a/official/benchmark/unet3d_benchmark.py b/official/benchmark/unet3d_benchmark.py index 2614b29259dcf4c85d609abca94706c95570b7ec..8c811e483e4d1935487f1175baf6f5786632c952 100644 --- a/official/benchmark/unet3d_benchmark.py +++ b/official/benchmark/unet3d_benchmark.py @@ -93,8 +93,11 @@ class Unet3DAccuracyBenchmark(keras_benchmark.KerasBenchmark): """Runs and reports the benchmark given the provided configuration.""" params = unet_training_lib.extract_params(FLAGS) strategy = unet_training_lib.create_distribution_strategy(params) - if params.use_bfloat16: - policy = tf.keras.mixed_precision.experimental.Policy('mixed_bfloat16') + + input_dtype = params.dtype + if input_dtype == 'float16' or input_dtype == 'bfloat16': + policy = tf.keras.mixed_precision.experimental.Policy( + 'mixed_bfloat16' if input_dtype == 'bfloat16' else 'mixed_float16') tf.keras.mixed_precision.experimental.set_policy(policy) stats = {} diff --git a/official/colab/fine_tuning_bert.ipynb b/official/colab/fine_tuning_bert.ipynb index 443674b6b9f1292d25f26cc06e3359506763bfce..b63c9a3f6d7912c61eee0a948406c8934061e88c 100644 --- a/official/colab/fine_tuning_bert.ipynb +++ b/official/colab/fine_tuning_bert.ipynb @@ -12,7 +12,7 @@ }, { "cell_type": "code", - "execution_count": 0, + "execution_count": null, "metadata": { "cellView": "form", "colab": {}, @@ -104,7 +104,7 @@ }, { "cell_type": "code", - "execution_count": 0, + "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", @@ -128,7 +128,7 @@ }, { "cell_type": "code", - "execution_count": 0, + "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", @@ -185,7 +185,7 @@ }, { "cell_type": "code", - "execution_count": 0, + "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", @@ -204,12 +204,12 @@ "id": "9uFskufsR2LT" }, "source": [ - "You can get a pre-trained BERT encoder from TensorFlow Hub here:" + "You can get a pre-trained BERT encoder from [TensorFlow Hub](https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/2):" ] }, { "cell_type": "code", - "execution_count": 0, + "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", @@ -252,7 +252,7 @@ }, { "cell_type": "code", - "execution_count": 0, + "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", @@ -267,7 +267,7 @@ }, { "cell_type": "code", - "execution_count": 0, + "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", @@ -290,7 +290,7 @@ }, { "cell_type": "code", - "execution_count": 0, + "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", @@ -313,7 +313,7 @@ }, { "cell_type": "code", - "execution_count": 0, + "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", @@ -336,7 +336,7 @@ }, { "cell_type": "code", - "execution_count": 0, + "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", @@ -376,7 +376,7 @@ }, { "cell_type": "code", - "execution_count": 0, + "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", @@ -404,7 +404,7 @@ }, { "cell_type": "code", - "execution_count": 0, + "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", @@ -446,7 +446,7 @@ }, { "cell_type": "code", - "execution_count": 0, + "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", @@ -469,7 +469,7 @@ }, { "cell_type": "code", - "execution_count": 0, + "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", @@ -490,7 +490,7 @@ }, { "cell_type": "code", - "execution_count": 0, + "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", @@ -514,7 +514,7 @@ }, { "cell_type": "code", - "execution_count": 0, + "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", @@ -562,7 +562,7 @@ }, { "cell_type": "code", - "execution_count": 0, + "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", @@ -587,7 +587,7 @@ }, { "cell_type": "code", - "execution_count": 0, + "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", @@ -617,7 +617,7 @@ }, { "cell_type": "code", - "execution_count": 0, + "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", @@ -661,7 +661,7 @@ }, { "cell_type": "code", - "execution_count": 0, + "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", @@ -691,7 +691,7 @@ }, { "cell_type": "code", - "execution_count": 0, + "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", @@ -737,7 +737,7 @@ }, { "cell_type": "code", - "execution_count": 0, + "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", @@ -769,7 +769,7 @@ }, { "cell_type": "code", - "execution_count": 0, + "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", @@ -793,7 +793,7 @@ }, { "cell_type": "code", - "execution_count": 0, + "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", @@ -816,7 +816,7 @@ }, { "cell_type": "code", - "execution_count": 0, + "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", @@ -845,7 +845,7 @@ }, { "cell_type": "code", - "execution_count": 0, + "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", @@ -870,7 +870,7 @@ }, { "cell_type": "code", - "execution_count": 0, + "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", @@ -908,7 +908,7 @@ }, { "cell_type": "code", - "execution_count": 0, + "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", @@ -943,7 +943,7 @@ }, { "cell_type": "code", - "execution_count": 0, + "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", @@ -986,7 +986,7 @@ }, { "cell_type": "code", - "execution_count": 0, + "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", @@ -1023,7 +1023,7 @@ }, { "cell_type": "code", - "execution_count": 0, + "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", @@ -1055,7 +1055,7 @@ }, { "cell_type": "code", - "execution_count": 0, + "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", @@ -1071,7 +1071,7 @@ }, { "cell_type": "code", - "execution_count": 0, + "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", @@ -1096,7 +1096,7 @@ }, { "cell_type": "code", - "execution_count": 0, + "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", @@ -1110,7 +1110,7 @@ }, { "cell_type": "code", - "execution_count": 0, + "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", @@ -1176,7 +1176,7 @@ }, { "cell_type": "code", - "execution_count": 0, + "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", @@ -1201,7 +1201,7 @@ }, { "cell_type": "code", - "execution_count": 0, + "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", @@ -1240,7 +1240,7 @@ }, { "cell_type": "code", - "execution_count": 0, + "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", @@ -1273,7 +1273,7 @@ }, { "cell_type": "code", - "execution_count": 0, + "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", @@ -1306,7 +1306,7 @@ }, { "cell_type": "code", - "execution_count": 0, + "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", @@ -1351,7 +1351,7 @@ }, { "cell_type": "code", - "execution_count": 0, + "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", @@ -1379,7 +1379,7 @@ }, { "cell_type": "code", - "execution_count": 0, + "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", @@ -1406,17 +1406,44 @@ }, { "cell_type": "code", - "execution_count": 0, + "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", - "id": "lo6479At4sP1" + "id": "GDWrHm0BGpbX" }, "outputs": [], "source": [ "# Note: 350MB download.\n", - "import tensorflow_hub as hub\n", - "hub_encoder = hub.KerasLayer(hub_url_bert, trainable=True)\n", + "import tensorflow_hub as hub" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "cellView": "form", + "colab": {}, + "colab_type": "code", + "id": "Y29meH0qGq_5" + }, + "outputs": [], + "source": [ + "hub_model_name = \"bert_en_uncased_L-12_H-768_A-12\" #@param [\"bert_en_uncased_L-24_H-1024_A-16\", \"bert_en_wwm_cased_L-24_H-1024_A-16\", \"bert_en_uncased_L-12_H-768_A-12\", \"bert_en_wwm_uncased_L-24_H-1024_A-16\", \"bert_en_cased_L-24_H-1024_A-16\", \"bert_en_cased_L-12_H-768_A-12\", \"bert_zh_L-12_H-768_A-12\", \"bert_multi_cased_L-12_H-768_A-12\"]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "lo6479At4sP1" + }, + "outputs": [], + "source": [ + "hub_encoder = hub.KerasLayer(f\"https://tfhub.dev/tensorflow/{hub_model_name}\",\n", + " trainable=True)\n", "\n", "print(f\"The Hub encoder has {len(hub_encoder.trainable_variables)} trainable variables\")" ] @@ -1433,7 +1460,7 @@ }, { "cell_type": "code", - "execution_count": 0, + "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", @@ -1466,7 +1493,7 @@ }, { "cell_type": "code", - "execution_count": 0, + "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", @@ -1491,7 +1518,7 @@ }, { "cell_type": "code", - "execution_count": 0, + "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", @@ -1504,7 +1531,7 @@ }, { "cell_type": "code", - "execution_count": 0, + "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", @@ -1545,7 +1572,7 @@ }, { "cell_type": "code", - "execution_count": 0, + "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", @@ -1569,7 +1596,7 @@ }, { "cell_type": "code", - "execution_count": 0, + "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", @@ -1592,7 +1619,7 @@ }, { "cell_type": "code", - "execution_count": 0, + "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", @@ -1617,7 +1644,7 @@ }, { "cell_type": "code", - "execution_count": 0, + "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", @@ -1643,7 +1670,7 @@ }, { "cell_type": "code", - "execution_count": 0, + "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", @@ -1661,7 +1688,7 @@ }, { "cell_type": "code", - "execution_count": 0, + "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", @@ -1688,7 +1715,7 @@ }, { "cell_type": "code", - "execution_count": 0, + "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", @@ -1714,7 +1741,7 @@ }, { "cell_type": "code", - "execution_count": 0, + "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", @@ -1733,7 +1760,7 @@ }, { "cell_type": "code", - "execution_count": 0, + "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", @@ -1761,7 +1788,7 @@ }, { "cell_type": "code", - "execution_count": 0, + "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", @@ -1795,7 +1822,7 @@ }, { "cell_type": "code", - "execution_count": 0, + "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", diff --git a/official/colab/nlp/customize_encoder.ipynb b/official/colab/nlp/customize_encoder.ipynb new file mode 100644 index 0000000000000000000000000000000000000000..18b45d3a66fcaab007d25c1d6db1cd461509daa2 --- /dev/null +++ b/official/colab/nlp/customize_encoder.ipynb @@ -0,0 +1,625 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "Bp8t2AI8i7uP" + }, + "source": [ + "##### Copyright 2020 The TensorFlow Authors." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "cellView": "form", + "colab": {}, + "colab_type": "code", + "id": "rxPj2Lsni9O4" + }, + "outputs": [], + "source": [ + "#@title Licensed under the Apache License, Version 2.0 (the \"License\");\n", + "# you may not use this file except in compliance with the License.\n", + "# You may obtain a copy of the License at\n", + "#\n", + "# https://www.apache.org/licenses/LICENSE-2.0\n", + "#\n", + "# Unless required by applicable law or agreed to in writing, software\n", + "# distributed under the License is distributed on an \"AS IS\" BASIS,\n", + "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", + "# See the License for the specific language governing permissions and\n", + "# limitations under the License." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "6xS-9i5DrRvO" + }, + "source": [ + "# Customizing a Transformer Encoder" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "Mwb9uw1cDXsa" + }, + "source": [ + "\u003ctable class=\"tfo-notebook-buttons\" align=\"left\"\u003e\n", + " \u003ctd\u003e\n", + " \u003ca target=\"_blank\" href=\"https://www.tensorflow.org/official_models/nlp/customize_encoder\"\u003e\u003cimg src=\"https://www.tensorflow.org/images/tf_logo_32px.png\" /\u003eView on TensorFlow.org\u003c/a\u003e\n", + " \u003c/td\u003e\n", + " \u003ctd\u003e\n", + " \u003ca target=\"_blank\" href=\"https://colab.research.google.com/github/tensorflow/models/blob/master/official/colab/nlp/customize_encoder.ipynb\"\u003e\u003cimg src=\"https://www.tensorflow.org/images/colab_logo_32px.png\" /\u003eRun in Google Colab\u003c/a\u003e\n", + " \u003c/td\u003e\n", + " \u003ctd\u003e\n", + " \u003ca target=\"_blank\" href=\"https://github.com/tensorflow/models/blob/master/official/colab/nlp/customize_encoder.ipynb\"\u003e\u003cimg src=\"https://www.tensorflow.org/images/GitHub-Mark-32px.png\" /\u003eView source on GitHub\u003c/a\u003e\n", + " \u003c/td\u003e\n", + " \u003ctd\u003e\n", + " \u003ca href=\"https://storage.googleapis.com/tensorflow_docs/models/official/colab/nlp/customize_encoder.ipynb\"\u003e\u003cimg src=\"https://www.tensorflow.org/images/download_logo_32px.png\" /\u003eDownload notebook\u003c/a\u003e\n", + " \u003c/td\u003e\n", + "\u003c/table\u003e" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "iLrcV4IyrcGX" + }, + "source": [ + "## Learning objectives\n", + "\n", + "The [TensorFlow Models NLP library](https://github.com/tensorflow/models/tree/master/official/nlp/modeling) is a collection of tools for building and training modern high performance natural language models.\n", + "\n", + "The [TransformEncoder](https://github.com/tensorflow/models/blob/master/official/nlp/modeling/networks/encoder_scaffold.py) is the core of this library, and lots of new network architectures are proposed to improve the encoder. In this Colab notebook, we will learn how to customize the encoder to employ new network architectures." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "YYxdyoWgsl8t" + }, + "source": [ + "## Install and import" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "fEJSFutUsn_h" + }, + "source": [ + "### Install the TensorFlow Model Garden pip package\n", + "\n", + "* `tf-models-nightly` is the nightly Model Garden package created daily automatically.\n", + "* `pip` will install all models and dependencies automatically." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "thsKZDjhswhR" + }, + "outputs": [], + "source": [ + "!pip install -q tf-nightly\n", + "!pip install -q tf-models-nightly" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "hpf7JPCVsqtv" + }, + "source": [ + "### Import Tensorflow and other libraries" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "my4dp-RMssQe" + }, + "outputs": [], + "source": [ + "import numpy as np\n", + "import tensorflow as tf\n", + "\n", + "from official.modeling import activations\n", + "from official.nlp import modeling\n", + "from official.nlp.modeling import layers, losses, models, networks" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "vjDmVsFfs85n" + }, + "source": [ + "## Canonical BERT encoder\n", + "\n", + "Before learning how to customize the encoder, let's firstly create a canonical BERT enoder and use it to instantiate a `BertClassifier` for classification task." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "Oav8sbgstWc-" + }, + "outputs": [], + "source": [ + "cfg = {\n", + " \"vocab_size\": 100,\n", + " \"hidden_size\": 32,\n", + " \"num_layers\": 3,\n", + " \"num_attention_heads\": 4,\n", + " \"intermediate_size\": 64,\n", + " \"activation\": activations.gelu,\n", + " \"dropout_rate\": 0.1,\n", + " \"attention_dropout_rate\": 0.1,\n", + " \"sequence_length\": 16,\n", + " \"type_vocab_size\": 2,\n", + " \"initializer\": tf.keras.initializers.TruncatedNormal(stddev=0.02),\n", + "}\n", + "bert_encoder = modeling.networks.TransformerEncoder(**cfg)\n", + "\n", + "def build_classifier(bert_encoder):\n", + " return modeling.models.BertClassifier(bert_encoder, num_classes=2)\n", + "\n", + "canonical_classifier_model = build_classifier(bert_encoder)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "Qe2UWI6_tsHo" + }, + "source": [ + "`canonical_classifier_model` can be trained using the training data. For details about how to train the model, please see the colab [fine_tuning_bert.ipynb](https://github.com/tensorflow/models/blob/master/official/colab/fine_tuning_bert.ipynb). We skip the code that trains the model here.\n", + "\n", + "After training, we can apply the model to do prediction.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "csED2d-Yt5h6" + }, + "outputs": [], + "source": [ + "def predict(model):\n", + " batch_size = 3\n", + " np.random.seed(0)\n", + " word_ids = np.random.randint(\n", + " cfg[\"vocab_size\"], size=(batch_size, cfg[\"sequence_length\"]))\n", + " mask = np.random.randint(2, size=(batch_size, cfg[\"sequence_length\"]))\n", + " type_ids = np.random.randint(\n", + " cfg[\"type_vocab_size\"], size=(batch_size, cfg[\"sequence_length\"]))\n", + " print(model([word_ids, mask, type_ids], training=False))\n", + "\n", + "predict(canonical_classifier_model)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "PzKStEK9t_Pb" + }, + "source": [ + "## Customize BERT encoder\n", + "\n", + "One BERT encoder consists of an embedding network and multiple transformer blocks, and each transformer block contains an attention layer and a feedforward layer." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "rmwQfhj6fmKz" + }, + "source": [ + "We provide easy ways to customize each of those components via (1)\n", + "[EncoderScaffold](https://github.com/tensorflow/models/blob/master/official/nlp/modeling/networks/encoder_scaffold.py) and (2) [TransformerScaffold](https://github.com/tensorflow/models/blob/master/official/nlp/modeling/layers/transformer_scaffold.py)." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "xsMgEVHAui11" + }, + "source": [ + "### Use EncoderScaffold\n", + "\n", + "`EncoderScaffold` allows users to provide a custom embedding subnetwork\n", + " (which will replace the standard embedding logic) and/or a custom hidden layer class (which will replace the `Transformer` instantiation in the encoder)." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "-JBabpa2AOz8" + }, + "source": [ + "#### Without Customization\n", + "\n", + "Without any customization, `EncoderScaffold` behaves the same the canonical `TransformerEncoder`.\n", + "\n", + "As shown in the following example, `EncoderScaffold` can load `TransformerEncoder`'s weights and output the same values:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "ktNzKuVByZQf" + }, + "outputs": [], + "source": [ + "default_hidden_cfg = dict(\n", + " num_attention_heads=cfg[\"num_attention_heads\"],\n", + " intermediate_size=cfg[\"intermediate_size\"],\n", + " intermediate_activation=activations.gelu,\n", + " dropout_rate=cfg[\"dropout_rate\"],\n", + " attention_dropout_rate=cfg[\"attention_dropout_rate\"],\n", + " kernel_initializer=tf.keras.initializers.TruncatedNormal(0.02),\n", + ")\n", + "default_embedding_cfg = dict(\n", + " vocab_size=cfg[\"vocab_size\"],\n", + " type_vocab_size=cfg[\"type_vocab_size\"],\n", + " hidden_size=cfg[\"hidden_size\"],\n", + " seq_length=cfg[\"sequence_length\"],\n", + " initializer=tf.keras.initializers.TruncatedNormal(0.02),\n", + " dropout_rate=cfg[\"dropout_rate\"],\n", + " max_seq_length=cfg[\"sequence_length\"],\n", + ")\n", + "default_kwargs = dict(\n", + " hidden_cfg=default_hidden_cfg,\n", + " embedding_cfg=default_embedding_cfg,\n", + " num_hidden_instances=cfg[\"num_layers\"],\n", + " pooled_output_dim=cfg[\"hidden_size\"],\n", + " return_all_layer_outputs=True,\n", + " pooler_layer_initializer=tf.keras.initializers.TruncatedNormal(0.02),\n", + ")\n", + "encoder_scaffold = modeling.networks.EncoderScaffold(**default_kwargs)\n", + "classifier_model_from_encoder_scaffold = build_classifier(encoder_scaffold)\n", + "classifier_model_from_encoder_scaffold.set_weights(\n", + " canonical_classifier_model.get_weights())\n", + "predict(classifier_model_from_encoder_scaffold)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "sMaUmLyIuwcs" + }, + "source": [ + "#### Customize Embedding\n", + "\n", + "Next, we show how to use a customized embedding network.\n", + "\n", + "We firstly build an embedding network that will replace the default network. This one will have 2 inputs (`mask` and `word_ids`) instead of 3, and won't use positional embeddings." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "LTinnaG6vcsw" + }, + "outputs": [], + "source": [ + "word_ids = tf.keras.layers.Input(\n", + " shape=(cfg['sequence_length'],), dtype=tf.int32, name=\"input_word_ids\")\n", + "mask = tf.keras.layers.Input(\n", + " shape=(cfg['sequence_length'],), dtype=tf.int32, name=\"input_mask\")\n", + "embedding_layer = modeling.layers.OnDeviceEmbedding(\n", + " vocab_size=cfg['vocab_size'],\n", + " embedding_width=cfg['hidden_size'],\n", + " initializer=tf.keras.initializers.TruncatedNormal(stddev=0.02),\n", + " name=\"word_embeddings\")\n", + "word_embeddings = embedding_layer(word_ids)\n", + "attention_mask = layers.SelfAttentionMask()([word_embeddings, mask])\n", + "new_embedding_network = tf.keras.Model([word_ids, mask],\n", + " [word_embeddings, attention_mask])" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "HN7_yu-6O3qI" + }, + "source": [ + "Inspecting `new_embedding_network`, we can see it takes two inputs:\n", + "`input_word_ids` and `input_mask`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "fO9zKFE4OpHp" + }, + "outputs": [], + "source": [ + "tf.keras.utils.plot_model(new_embedding_network, show_shapes=True, dpi=48)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "9cOaGQHLv12W" + }, + "source": [ + "We then can build a new encoder using the above `new_embedding_network`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "mtFDMNf2vIl9" + }, + "outputs": [], + "source": [ + "kwargs = dict(default_kwargs)\n", + "\n", + "# Use new embedding network.\n", + "kwargs['embedding_cls'] = new_embedding_network\n", + "kwargs['embedding_data'] = embedding_layer.embeddings\n", + "\n", + "encoder_with_customized_embedding = modeling.networks.EncoderScaffold(**kwargs)\n", + "classifier_model = build_classifier(encoder_with_customized_embedding)\n", + "# ... Train the model ...\n", + "print(classifier_model.inputs)\n", + "\n", + "# Assert that there are only two inputs.\n", + "assert len(classifier_model.inputs) == 2" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "Z73ZQDtmwg9K" + }, + "source": [ + "#### Customized Transformer\n", + "\n", + "User can also override the [hidden_cls](https://github.com/tensorflow/models/blob/master/official/nlp/modeling/networks/encoder_scaffold.py#L103) argument in `EncoderScaffold`'s constructor to employ a customized Transformer layer.\n", + "\n", + "See [ReZeroTransformer](https://github.com/tensorflow/models/blob/master/official/nlp/modeling/layers/rezero_transformer.py) for how to implement a customized Transformer layer.\n", + "\n", + "Following is an example of using `ReZeroTransformer`:\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "uAIarLZgw6pA" + }, + "outputs": [], + "source": [ + "kwargs = dict(default_kwargs)\n", + "\n", + "# Use ReZeroTransformer.\n", + "kwargs['hidden_cls'] = modeling.layers.ReZeroTransformer\n", + "\n", + "encoder_with_rezero_transformer = modeling.networks.EncoderScaffold(**kwargs)\n", + "classifier_model = build_classifier(encoder_with_rezero_transformer)\n", + "# ... Train the model ...\n", + "predict(classifier_model)\n", + "\n", + "# Assert that the variable `rezero_alpha` from ReZeroTransformer exists.\n", + "assert 'rezero_alpha' in ''.join([x.name for x in classifier_model.trainable_weights])" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "6PMHFdvnxvR0" + }, + "source": [ + "### Use [TransformerScaffold](https://github.com/tensorflow/models/blob/master/official/nlp/modeling/layers/transformer_scaffold.py)\n", + "\n", + "The above method of customizing `Transformer` requires rewriting the whole `Transformer` layer, while sometimes you may only want to customize either attention layer or feedforward block. In this case, [TransformerScaffold](https://github.com/tensorflow/models/blob/master/official/nlp/modeling/layers/transformer_scaffold.py) can be used.\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "D6FejlgwyAy_" + }, + "source": [ + "#### Customize Attention Layer\n", + "\n", + "User can also override the [attention_cls](https://github.com/tensorflow/models/blob/master/official/nlp/modeling/layers/transformer_scaffold.py#L45) argument in `TransformerScaffold`'s constructor to employ a customized Attention layer.\n", + "\n", + "See [TalkingHeadsAttention](https://github.com/tensorflow/models/blob/master/official/nlp/modeling/layers/talking_heads_attention.py) for how to implement a customized `Attention` layer.\n", + "\n", + "Following is an example of using [TalkingHeadsAttention](https://github.com/tensorflow/models/blob/master/official/nlp/modeling/layers/talking_heads_attention.py):" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "nFrSMrZuyNeQ" + }, + "outputs": [], + "source": [ + "# Use TalkingHeadsAttention\n", + "hidden_cfg = dict(default_hidden_cfg)\n", + "hidden_cfg['attention_cls'] = modeling.layers.TalkingHeadsAttention\n", + "\n", + "kwargs = dict(default_kwargs)\n", + "kwargs['hidden_cls'] = modeling.layers.TransformerScaffold\n", + "kwargs['hidden_cfg'] = hidden_cfg\n", + "\n", + "encoder = modeling.networks.EncoderScaffold(**kwargs)\n", + "classifier_model = build_classifier(encoder)\n", + "# ... Train the model ...\n", + "predict(classifier_model)\n", + "\n", + "# Assert that the variable `pre_softmax_weight` from TalkingHeadsAttention exists.\n", + "assert 'pre_softmax_weight' in ''.join([x.name for x in classifier_model.trainable_weights])" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "kuEJcTyByVvI" + }, + "source": [ + "#### Customize Feedforward Layer\n", + "\n", + "Similiarly, one could also customize the feedforward layer.\n", + "\n", + "See [GatedFeedforward](https://github.com/tensorflow/models/blob/master/official/nlp/modeling/layers/gated_feedforward.py) for how to implement a customized feedforward layer.\n", + "\n", + "Following is an example of using [GatedFeedforward](https://github.com/tensorflow/models/blob/master/official/nlp/modeling/layers/gated_feedforward.py)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "XAbKy_l4y_-i" + }, + "outputs": [], + "source": [ + "# Use TalkingHeadsAttention\n", + "hidden_cfg = dict(default_hidden_cfg)\n", + "hidden_cfg['feedforward_cls'] = modeling.layers.GatedFeedforward\n", + "\n", + "kwargs = dict(default_kwargs)\n", + "kwargs['hidden_cls'] = modeling.layers.TransformerScaffold\n", + "kwargs['hidden_cfg'] = hidden_cfg\n", + "\n", + "encoder_with_gated_feedforward = modeling.networks.EncoderScaffold(**kwargs)\n", + "classifier_model = build_classifier(encoder_with_gated_feedforward)\n", + "# ... Train the model ...\n", + "predict(classifier_model)\n", + "\n", + "# Assert that the variable `gate` from GatedFeedforward exists.\n", + "assert 'gate' in ''.join([x.name for x in classifier_model.trainable_weights])" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "a_8NWUhkzeAq" + }, + "source": [ + "### Build a new Encoder using building blocks from KerasBERT.\n", + "\n", + "Finally, you could also build a new encoder using building blocks in the modeling library.\n", + "\n", + "See [AlbertTransformerEncoder](https://github.com/tensorflow/models/blob/master/official/nlp/modeling/networks/albert_transformer_encoder.py) as an example:\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "xsiA3RzUzmUM" + }, + "outputs": [], + "source": [ + "albert_encoder = modeling.networks.AlbertTransformerEncoder(**cfg)\n", + "classifier_model = build_classifier(albert_encoder)\n", + "# ... Train the model ...\n", + "predict(classifier_model)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "MeidDfhlHKSO" + }, + "source": [ + "Inspecting the `albert_encoder`, we see it stacks the same `Transformer` layer multiple times." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "Uv_juT22HERW" + }, + "outputs": [], + "source": [ + "tf.keras.utils.plot_model(albert_encoder, show_shapes=True, dpi=48)" + ] + } + ], + "metadata": { + "colab": { + "collapsed_sections": [], + "name": "Customizing a Transformer Encoder", + "private_outputs": true, + "provenance": [], + "toc_visible": true + }, + "kernelspec": { + "display_name": "Python 3", + "name": "python3" + } + }, + "nbformat": 4, + "nbformat_minor": 0 +} diff --git a/official/colab/nlp/nlp_modeling_library_intro.ipynb b/official/colab/nlp/nlp_modeling_library_intro.ipynb new file mode 100644 index 0000000000000000000000000000000000000000..f5ffcef96419aef9c25daaf8c585efe9a3043f73 --- /dev/null +++ b/official/colab/nlp/nlp_modeling_library_intro.ipynb @@ -0,0 +1,601 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "80xnUmoI7fBX" + }, + "source": [ + "##### Copyright 2020 The TensorFlow Authors." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "cellView": "form", + "colab": {}, + "colab_type": "code", + "id": "8nvTnfs6Q692" + }, + "outputs": [], + "source": [ + "#@title Licensed under the Apache License, Version 2.0 (the \"License\");\n", + "# you may not use this file except in compliance with the License.\n", + "# You may obtain a copy of the License at\n", + "#\n", + "# https://www.apache.org/licenses/LICENSE-2.0\n", + "#\n", + "# Unless required by applicable law or agreed to in writing, software\n", + "# distributed under the License is distributed on an \"AS IS\" BASIS,\n", + "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", + "# See the License for the specific language governing permissions and\n", + "# limitations under the License." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "WmfcMK5P5C1G" + }, + "source": [ + "# Introduction to the TensorFlow Models NLP library" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "cH-oJ8R6AHMK" + }, + "source": [ + "\u003ctable class=\"tfo-notebook-buttons\" align=\"left\"\u003e\n", + " \u003ctd\u003e\n", + " \u003ca target=\"_blank\" href=\"https://www.tensorflow.org/official_models/nlp/nlp_modeling_library_intro\"\u003e\u003cimg src=\"https://www.tensorflow.org/images/tf_logo_32px.png\" /\u003eView on TensorFlow.org\u003c/a\u003e\n", + " \u003c/td\u003e\n", + " \u003ctd\u003e\n", + " \u003ca target=\"_blank\" href=\"https://colab.research.google.com/github/tensorflow/models/blob/master/official/colab/nlp/nlp_modeling_library_intro.ipynb\"\u003e\u003cimg src=\"https://www.tensorflow.org/images/colab_logo_32px.png\" /\u003eRun in Google Colab\u003c/a\u003e\n", + " \u003c/td\u003e\n", + " \u003ctd\u003e\n", + " \u003ca target=\"_blank\" href=\"https://github.com/tensorflow/models/blob/master/official/colab/nlp/nlp_modeling_library_intro.ipynb\"\u003e\u003cimg src=\"https://www.tensorflow.org/images/GitHub-Mark-32px.png\" /\u003eView source on GitHub\u003c/a\u003e\n", + " \u003c/td\u003e\n", + " \u003ctd\u003e\n", + " \u003ca href=\"https://storage.googleapis.com/tensorflow_docs/models/official/colab/nlp/nlp_modeling_library_intro.ipynb\"\u003e\u003cimg src=\"https://www.tensorflow.org/images/download_logo_32px.png\" /\u003eDownload notebook\u003c/a\u003e\n", + " \u003c/td\u003e\n", + "\u003c/table\u003e" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "0H_EFIhq4-MJ" + }, + "source": [ + "## Learning objectives\n", + "\n", + "In this Colab notebook, you will learn how to build transformer-based models for common NLP tasks including pretraining, span labelling and classification using the building blocks from [NLP modeling library](https://github.com/tensorflow/models/tree/master/official/nlp/modeling)." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "2N97-dps_nUk" + }, + "source": [ + "## Install and import" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "459ygAVl_rg0" + }, + "source": [ + "### Install the TensorFlow Model Garden pip package\n", + "\n", + "* `tf-models-nightly` is the nightly Model Garden package created daily automatically.\n", + "* `pip` will install all models and dependencies automatically." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "Y-qGkdh6_sZc" + }, + "outputs": [], + "source": [ + "!pip install -q tf-nightly\n", + "!pip install -q tf-models-nightly" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "e4huSSwyAG_5" + }, + "source": [ + "### Import Tensorflow and other libraries" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "jqYXqtjBAJd9" + }, + "outputs": [], + "source": [ + "import numpy as np\n", + "import tensorflow as tf\n", + "\n", + "from official.nlp import modeling\n", + "from official.nlp.modeling import layers, losses, models, networks" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "djBQWjvy-60Y" + }, + "source": [ + "## BERT pretraining model\n", + "\n", + "BERT ([Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805)) introduced the method of pre-training language representations on a large text corpus and then using that model for downstream NLP tasks.\n", + "\n", + "In this section, we will learn how to build a model to pretrain BERT on the masked language modeling task and next sentence prediction task. For simplicity, we only show the minimum example and use dummy data." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "MKuHVlsCHmiq" + }, + "source": [ + "### Build a `BertPretrainer` model wrapping `TransformerEncoder`\n", + "\n", + "The [TransformerEncoder](https://github.com/tensorflow/models/blob/master/official/nlp/modeling/networks/transformer_encoder.py) implements the Transformer-based encoder as described in [BERT paper](https://arxiv.org/abs/1810.04805). It includes the embedding lookups and transformer layers, but not the masked language model or classification task networks.\n", + "\n", + "The [BertPretrainer](https://github.com/tensorflow/models/blob/master/official/nlp/modeling/models/bert_pretrainer.py) allows a user to pass in a transformer stack, and instantiates the masked language model and classification networks that are used to create the training objectives." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "EXkcXz-9BwB3" + }, + "outputs": [], + "source": [ + "# Build a small transformer network.\n", + "vocab_size = 100\n", + "sequence_length = 16\n", + "network = modeling.networks.TransformerEncoder(\n", + " vocab_size=vocab_size, num_layers=2, sequence_length=16)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "0NH5irV5KTMS" + }, + "source": [ + "Inspecting the encoder, we see it contains few embedding layers, stacked `Transformer` layers and are connected to three input layers:\n", + "\n", + "`input_word_ids`, `input_type_ids` and `input_mask`.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "lZNoZkBrIoff" + }, + "outputs": [], + "source": [ + "tf.keras.utils.plot_model(network, show_shapes=True, dpi=48)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "o7eFOZXiIl-b" + }, + "outputs": [], + "source": [ + "# Create a BERT pretrainer with the created network.\n", + "num_token_predictions = 8\n", + "bert_pretrainer = modeling.models.BertPretrainer(\n", + " network, num_classes=2, num_token_predictions=num_token_predictions, output='predictions')" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "d5h5HT7gNHx_" + }, + "source": [ + "Inspecting the `bert_pretrainer`, we see it wraps the `encoder` with additional `MaskedLM` and `Classification` heads." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "2tcNfm03IBF7" + }, + "outputs": [], + "source": [ + "tf.keras.utils.plot_model(bert_pretrainer, show_shapes=True, dpi=48)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "F2oHrXGUIS0M" + }, + "outputs": [], + "source": [ + "# We can feed some dummy data to get masked language model and sentence output.\n", + "batch_size = 2\n", + "word_id_data = np.random.randint(vocab_size, size=(batch_size, sequence_length))\n", + "mask_data = np.random.randint(2, size=(batch_size, sequence_length))\n", + "type_id_data = np.random.randint(2, size=(batch_size, sequence_length))\n", + "masked_lm_positions_data = np.random.randint(2, size=(batch_size, num_token_predictions))\n", + "\n", + "outputs = bert_pretrainer(\n", + " [word_id_data, mask_data, type_id_data, masked_lm_positions_data])\n", + "lm_output = outputs[\"masked_lm\"]\n", + "sentence_output = outputs[\"classification\"]\n", + "print(lm_output)\n", + "print(sentence_output)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "bnx3UCHniCS5" + }, + "source": [ + "### Compute loss\n", + "Next, we can use `lm_output` and `sentence_output` to compute `loss`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "k30H4Q86f52x" + }, + "outputs": [], + "source": [ + "masked_lm_ids_data = np.random.randint(vocab_size, size=(batch_size, num_token_predictions))\n", + "masked_lm_weights_data = np.random.randint(2, size=(batch_size, num_token_predictions))\n", + "next_sentence_labels_data = np.random.randint(2, size=(batch_size))\n", + "\n", + "mlm_loss = modeling.losses.weighted_sparse_categorical_crossentropy_loss(\n", + " labels=masked_lm_ids_data,\n", + " predictions=lm_output,\n", + " weights=masked_lm_weights_data)\n", + "sentence_loss = modeling.losses.weighted_sparse_categorical_crossentropy_loss(\n", + " labels=next_sentence_labels_data,\n", + " predictions=sentence_output)\n", + "loss = mlm_loss + sentence_loss\n", + "print(loss)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "wrmSs8GjHxVw" + }, + "source": [ + "With the loss, you can optimize the model.\n", + "After training, we can save the weights of TransformerEncoder for the downstream fine-tuning tasks. Please see [run_pretraining.py](https://github.com/tensorflow/models/blob/master/official/nlp/bert/run_pretraining.py) for the full example.\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "k8cQVFvBCV4s" + }, + "source": [ + "## Span labeling model\n", + "\n", + "Span labeling is the task to assign labels to a span of the text, for example, label a span of text as the answer of a given question.\n", + "\n", + "In this section, we will learn how to build a span labeling model. Again, we use dummy data for simplicity." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "xrLLEWpfknUW" + }, + "source": [ + "### Build a BertSpanLabeler wrapping TransformerEncoder\n", + "\n", + "[BertSpanLabeler](https://github.com/tensorflow/models/blob/master/official/nlp/modeling/models/bert_span_labeler.py) implements a simple single-span start-end predictor (that is, a model that predicts two values: a start token index and an end token index), suitable for SQuAD-style tasks.\n", + "\n", + "Note that `BertSpanLabeler` wraps a `TransformerEncoder`, the weights of which can be restored from the above pretraining model.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "B941M4iUCejO" + }, + "outputs": [], + "source": [ + "network = modeling.networks.TransformerEncoder(\n", + " vocab_size=vocab_size, num_layers=2, sequence_length=sequence_length)\n", + "\n", + "# Create a BERT trainer with the created network.\n", + "bert_span_labeler = modeling.models.BertSpanLabeler(network)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "QpB9pgj4PpMg" + }, + "source": [ + "Inspecting the `bert_span_labeler`, we see it wraps the encoder with additional `SpanLabeling` that outputs `start_position` and `end_postion`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "RbqRNJCLJu4H" + }, + "outputs": [], + "source": [ + "tf.keras.utils.plot_model(bert_span_labeler, show_shapes=True, dpi=48)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "fUf1vRxZJwio" + }, + "outputs": [], + "source": [ + "# Create a set of 2-dimensional data tensors to feed into the model.\n", + "word_id_data = np.random.randint(vocab_size, size=(batch_size, sequence_length))\n", + "mask_data = np.random.randint(2, size=(batch_size, sequence_length))\n", + "type_id_data = np.random.randint(2, size=(batch_size, sequence_length))\n", + "\n", + "# Feed the data to the model.\n", + "start_logits, end_logits = bert_span_labeler([word_id_data, mask_data, type_id_data])\n", + "print(start_logits)\n", + "print(end_logits)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "WqhgQaN1lt-G" + }, + "source": [ + "### Compute loss\n", + "With `start_logits` and `end_logits`, we can compute loss:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "waqs6azNl3Nn" + }, + "outputs": [], + "source": [ + "start_positions = np.random.randint(sequence_length, size=(batch_size))\n", + "end_positions = np.random.randint(sequence_length, size=(batch_size))\n", + "\n", + "start_loss = tf.keras.losses.sparse_categorical_crossentropy(\n", + " start_positions, start_logits, from_logits=True)\n", + "end_loss = tf.keras.losses.sparse_categorical_crossentropy(\n", + " end_positions, end_logits, from_logits=True)\n", + "\n", + "total_loss = (tf.reduce_mean(start_loss) + tf.reduce_mean(end_loss)) / 2\n", + "print(total_loss)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "Zdf03YtZmd_d" + }, + "source": [ + "With the `loss`, you can optimize the model. Please see [run_squad.py](https://github.com/tensorflow/models/blob/master/official/nlp/bert/run_squad.py) for the full example." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "0A1XnGSTChg9" + }, + "source": [ + "## Classification model\n", + "\n", + "In the last section, we show how to build a text classification model.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "MSK8OpZgnQa9" + }, + "source": [ + "### Build a BertClassifier model wrapping TransformerEncoder\n", + "\n", + "[BertClassifier](https://github.com/tensorflow/models/blob/master/official/nlp/modeling/models/bert_classifier.py) implements a [CLS] token classification model containing a single classification head." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "cXXCsffkCphk" + }, + "outputs": [], + "source": [ + "network = modeling.networks.TransformerEncoder(\n", + " vocab_size=vocab_size, num_layers=2, sequence_length=sequence_length)\n", + "\n", + "# Create a BERT trainer with the created network.\n", + "num_classes = 2\n", + "bert_classifier = modeling.models.BertClassifier(\n", + " network, num_classes=num_classes)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "8tZKueKYP4bB" + }, + "source": [ + "Inspecting the `bert_classifier`, we see it wraps the `encoder` with additional `Classification` head." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "snlutm9ZJgEZ" + }, + "outputs": [], + "source": [ + "tf.keras.utils.plot_model(bert_classifier, show_shapes=True, dpi=48)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "yyHPHsqBJkCz" + }, + "outputs": [], + "source": [ + "# Create a set of 2-dimensional data tensors to feed into the model.\n", + "word_id_data = np.random.randint(vocab_size, size=(batch_size, sequence_length))\n", + "mask_data = np.random.randint(2, size=(batch_size, sequence_length))\n", + "type_id_data = np.random.randint(2, size=(batch_size, sequence_length))\n", + "\n", + "# Feed the data to the model.\n", + "logits = bert_classifier([word_id_data, mask_data, type_id_data])\n", + "print(logits)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "w--a2mg4nzKm" + }, + "source": [ + "### Compute loss\n", + "\n", + "With `logits`, we can compute `loss`:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "9X0S1DoFn_5Q" + }, + "outputs": [], + "source": [ + "labels = np.random.randint(num_classes, size=(batch_size))\n", + "\n", + "loss = modeling.losses.weighted_sparse_categorical_crossentropy_loss(\n", + " labels=labels, predictions=tf.nn.log_softmax(logits, axis=-1))\n", + "print(loss)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "mzBqOylZo3og" + }, + "source": [ + "With the `loss`, you can optimize the model. Please see [run_classifier.py](https://github.com/tensorflow/models/blob/master/official/nlp/bert/run_classifier.py) or the colab [fine_tuning_bert.ipynb](https://github.com/tensorflow/models/blob/master/official/colab/fine_tuning_bert.ipynb) for the full example." + ] + } + ], + "metadata": { + "colab": { + "collapsed_sections": [], + "name": "Introduction to the TensorFlow Models NLP library", + "private_outputs": true, + "provenance": [], + "toc_visible": true + }, + "kernelspec": { + "display_name": "Python 3", + "name": "python3" + } + }, + "nbformat": 4, + "nbformat_minor": 0 +} diff --git a/official/core/base_task.py b/official/core/base_task.py index f5dfdd4f5c2ff9b75b3571df31016196e92cd934..76ebd8e14dea783bd5e495bafeb2e3218ae26eb6 100644 --- a/official/core/base_task.py +++ b/official/core/base_task.py @@ -18,11 +18,11 @@ import abc import functools from typing import Any, Callable, Optional +from absl import logging import six import tensorflow as tf from official.modeling.hyperparams import config_definitions as cfg -from official.utils import registry @six.add_metaclass(abc.ABCMeta) @@ -37,17 +37,29 @@ class Task(tf.Module): # Special keys in train/validate step returned logs. loss = "loss" - def __init__(self, params: cfg.TaskConfig): + def __init__(self, params: cfg.TaskConfig, logging_dir: str = None): + """Task initialization. + + Args: + params: cfg.TaskConfig instance. + logging_dir: a string pointing to where the model, summaries etc. will be + saved. You can also write additional stuff in this directory. + """ self._task_config = params + self._logging_dir = logging_dir @property def task_config(self) -> cfg.TaskConfig: return self._task_config + @property + def logging_dir(self) -> str: + return self._logging_dir + def initialize(self, model: tf.keras.Model): """A callback function used as CheckpointManager's init_fn. - This function will be called when no checkpoint found for the model. + This function will be called when no checkpoint is found for the model. If there is a checkpoint, the checkpoint will be loaded and this function will not be called. You can use this callback function to load a pretrained checkpoint, saved under a directory other than the model_dir. @@ -55,11 +67,23 @@ class Task(tf.Module): Args: model: The keras.Model built or used by this task. """ - pass + ckpt_dir_or_file = self.task_config.init_checkpoint + logging.info("Trying to load pretrained checkpoint from %s", + ckpt_dir_or_file) + if tf.io.gfile.isdir(ckpt_dir_or_file): + ckpt_dir_or_file = tf.train.latest_checkpoint(ckpt_dir_or_file) + if not ckpt_dir_or_file: + return + + ckpt = tf.train.Checkpoint(**model.checkpoint_items) + status = ckpt.restore(ckpt_dir_or_file) + status.expect_partial().assert_existing_objects_matched() + logging.info("Finished loading pretrained checkpoint from %s", + ckpt_dir_or_file) @abc.abstractmethod def build_model(self) -> tf.keras.Model: - """Creates the model architecture. + """Creates model architecture. Returns: A model instance. @@ -107,6 +131,7 @@ class Task(tf.Module): """Returns a dataset or a nested structure of dataset functions. Dataset functions define per-host datasets with the per-replica batch size. + With distributed training, this method runs on remote hosts. Args: params: hyperparams to create input pipelines. @@ -122,7 +147,7 @@ class Task(tf.Module): Args: labels: optional label tensors. model_outputs: a nested structure of output tensors. - aux_losses: auxiliarly loss tensors, i.e. `losses` in keras.Model. + aux_losses: auxiliary loss tensors, i.e. `losses` in keras.Model. Returns: The total loss tensor. @@ -172,6 +197,8 @@ class Task(tf.Module): metrics=None): """Does forward and backward. + With distribution strategies, this method runs on devices. + Args: inputs: a dictionary of input tensors. model: the model, forward pass definition. @@ -217,7 +244,9 @@ class Task(tf.Module): return logs def validation_step(self, inputs, model: tf.keras.Model, metrics=None): - """Validatation step. + """Validation step. + + With distribution strategies, this method runs on devices. Args: inputs: a dictionary of input tensors. @@ -244,52 +273,24 @@ class Task(tf.Module): return logs def inference_step(self, inputs, model: tf.keras.Model): - """Performs the forward step.""" - return model(inputs, training=False) - - -_REGISTERED_TASK_CLS = {} - + """Performs the forward step. -# TODO(b/158268740): Move these outside the base class file. -# TODO(b/158741360): Add type annotations once pytype checks across modules. -def register_task_cls(task_config_cls): - """Decorates a factory of Tasks for lookup by a subclass of TaskConfig. + With distribution strategies, this method runs on devices. - This decorator supports registration of tasks as follows: + Args: + inputs: a dictionary of input tensors. + model: the keras.Model. - ``` - @dataclasses.dataclass - class MyTaskConfig(TaskConfig): - # Add fields here. - pass + Returns: + Model outputs. + """ + return model(inputs, training=False) - @register_task_cls(MyTaskConfig) - class MyTask(Task): - # Inherits def __init__(self, task_config). + def aggregate_logs(self, state, step_logs): + """Optional aggregation over logs returned from a validation step.""" pass - my_task_config = MyTaskConfig() - my_task = get_task(my_task_config) # Returns MyTask(my_task_config). - ``` - - Besisdes a class itself, other callables that create a Task from a TaskConfig - can be decorated by the result of this function, as long as there is at most - one registration for each config class. - - Args: - task_config_cls: a subclass of TaskConfig (*not* an instance of TaskConfig). - Each task_config_cls can only be used for a single registration. - - Returns: - A callable for use as class decorator that registers the decorated class - for creation from an instance of task_config_cls. - """ - return registry.register(_REGISTERED_TASK_CLS, task_config_cls) - + def reduce_aggregated_logs(self, aggregated_logs): + """Optional reduce of aggregated logs over validation steps.""" + return {} -# The user-visible get_task() is defined after classes have been registered. -# TODO(b/158741360): Add type annotations once pytype checks across modules. -def get_task_cls(task_config_cls): - task_cls = registry.lookup(_REGISTERED_TASK_CLS, task_config_cls) - return task_cls diff --git a/official/core/exp_factory.py b/official/core/exp_factory.py new file mode 100644 index 0000000000000000000000000000000000000000..8270565b7d97bfd820de26bbbda6d3f1d96e33d2 --- /dev/null +++ b/official/core/exp_factory.py @@ -0,0 +1,37 @@ +# Lint as: python3 +# Copyright 2020 The TensorFlow Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# ============================================================================== +"""Experiment factory methods.""" + +from official.modeling.hyperparams import config_definitions as cfg +from official.utils import registry + + +_REGISTERED_CONFIGS = {} + + +def register_config_factory(name): + """Register ExperimentConfig factory method.""" + return registry.register(_REGISTERED_CONFIGS, name) + + +def get_exp_config_creater(exp_name: str): + """Looks up ExperimentConfig factory methods.""" + exp_creater = registry.lookup(_REGISTERED_CONFIGS, exp_name) + return exp_creater + + +def get_exp_config(exp_name: str) -> cfg.ExperimentConfig: + return get_exp_config_creater(exp_name)() diff --git a/official/core/input_reader.py b/official/core/input_reader.py index 52f6e84e4bd02d4178586556ca191912de18fc18..20589ad9cee33546922cd5c9deaba67b2a0509ad 100644 --- a/official/core/input_reader.py +++ b/official/core/input_reader.py @@ -32,8 +32,9 @@ class InputReader: dataset_fn=tf.data.TFRecordDataset, decoder_fn: Optional[Callable[..., Any]] = None, parser_fn: Optional[Callable[..., Any]] = None, - dataset_transform_fn: Optional[Callable[[tf.data.Dataset], - tf.data.Dataset]] = None, + transform_and_batch_fn: Optional[Callable[ + [tf.data.Dataset, Optional[tf.distribute.InputContext]], + tf.data.Dataset]] = None, postprocess_fn: Optional[Callable[..., Any]] = None): """Initializes an InputReader instance. @@ -48,9 +49,12 @@ class InputReader: parser_fn: An optional `callable` that takes the decoded raw tensors dict and parse them into a dictionary of tensors that can be consumed by the model. It will be executed after decoder_fn. - dataset_transform_fn: An optional `callable` that takes a - `tf.data.Dataset` object and returns a `tf.data.Dataset`. It will be - executed after parser_fn. + transform_and_batch_fn: An optional `callable` that takes a + `tf.data.Dataset` object and an optional `tf.distribute.InputContext` as + input, and returns a `tf.data.Dataset` object. It will be + executed after `parser_fn` to transform and batch the dataset; if None, + after `parser_fn` is executed, the dataset will be batched into + per-replica batch size. postprocess_fn: A optional `callable` that processes batched tensors. It will be executed after batching. """ @@ -101,7 +105,7 @@ class InputReader: self._dataset_fn = dataset_fn self._decoder_fn = decoder_fn self._parser_fn = parser_fn - self._dataset_transform_fn = dataset_transform_fn + self._transform_and_batch_fn = transform_and_batch_fn self._postprocess_fn = postprocess_fn def _read_sharded_files( @@ -171,6 +175,9 @@ class InputReader: as_supervised=self._tfds_as_supervised, decoders=decoders, read_config=read_config) + + if self._is_training: + dataset = dataset.repeat() return dataset @property @@ -211,13 +218,13 @@ class InputReader: dataset = maybe_map_fn(dataset, self._decoder_fn) dataset = maybe_map_fn(dataset, self._parser_fn) - if self._dataset_transform_fn is not None: - dataset = self._dataset_transform_fn(dataset) - - per_replica_batch_size = input_context.get_per_replica_batch_size( - self._global_batch_size) if input_context else self._global_batch_size + if self._transform_and_batch_fn is not None: + dataset = self._transform_and_batch_fn(dataset, input_context) + else: + per_replica_batch_size = input_context.get_per_replica_batch_size( + self._global_batch_size) if input_context else self._global_batch_size + dataset = dataset.batch( + per_replica_batch_size, drop_remainder=self._drop_remainder) - dataset = dataset.batch( - per_replica_batch_size, drop_remainder=self._drop_remainder) dataset = maybe_map_fn(dataset, self._postprocess_fn) return dataset.prefetch(tf.data.experimental.AUTOTUNE) diff --git a/official/core/task_factory.py b/official/core/task_factory.py new file mode 100644 index 0000000000000000000000000000000000000000..394031ae99405bf9b69d6236d41c423fcb886697 --- /dev/null +++ b/official/core/task_factory.py @@ -0,0 +1,68 @@ +# Lint as: python3 +# Copyright 2020 The TensorFlow Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# ============================================================================== +"""A global factory to register and access all registered tasks.""" + +from official.utils import registry + +_REGISTERED_TASK_CLS = {} + + +# TODO(b/158741360): Add type annotations once pytype checks across modules. +def register_task_cls(task_config_cls): + """Decorates a factory of Tasks for lookup by a subclass of TaskConfig. + + This decorator supports registration of tasks as follows: + + ``` + @dataclasses.dataclass + class MyTaskConfig(TaskConfig): + # Add fields here. + pass + + @register_task_cls(MyTaskConfig) + class MyTask(Task): + # Inherits def __init__(self, task_config). + pass + + my_task_config = MyTaskConfig() + my_task = get_task(my_task_config) # Returns MyTask(my_task_config). + ``` + + Besisdes a class itself, other callables that create a Task from a TaskConfig + can be decorated by the result of this function, as long as there is at most + one registration for each config class. + + Args: + task_config_cls: a subclass of TaskConfig (*not* an instance of TaskConfig). + Each task_config_cls can only be used for a single registration. + + Returns: + A callable for use as class decorator that registers the decorated class + for creation from an instance of task_config_cls. + """ + return registry.register(_REGISTERED_TASK_CLS, task_config_cls) + + +def get_task(task_config, **kwargs): + """Creates a Task (of suitable subclass type) from task_config.""" + return get_task_cls(task_config.__class__)(task_config, **kwargs) + + +# The user-visible get_task() is defined after classes have been registered. +# TODO(b/158741360): Add type annotations once pytype checks across modules. +def get_task_cls(task_config_cls): + task_cls = registry.lookup(_REGISTERED_TASK_CLS, task_config_cls) + return task_cls diff --git a/official/modeling/activations/gelu.py b/official/modeling/activations/gelu.py index c045bffa95b29e069831b548701b76d1b8e76c0d..dc4de8204ae81e9ad8c17f12ed0973fb0eff3c86 100644 --- a/official/modeling/activations/gelu.py +++ b/official/modeling/activations/gelu.py @@ -14,12 +14,6 @@ # ============================================================================== """Gaussian error linear unit.""" -from __future__ import absolute_import -from __future__ import division -from __future__ import print_function - -import math - import tensorflow as tf @@ -35,6 +29,4 @@ def gelu(x): Returns: `x` with the GELU activation applied. """ - cdf = 0.5 * (1.0 + tf.tanh( - (math.sqrt(2 / math.pi) * (x + 0.044715 * tf.pow(x, 3))))) - return x * cdf + return tf.keras.activations.gelu(x, approximate=True) diff --git a/official/modeling/hyperparams/base_config.py b/official/modeling/hyperparams/base_config.py index 7ce5ce2d55016dce0c985a0e6f9fe3893a25f644..b03f069c8bdae2317bd57ac9b2cc4c91ce9d169b 100644 --- a/official/modeling/hyperparams/base_config.py +++ b/official/modeling/hyperparams/base_config.py @@ -126,10 +126,10 @@ class Config(params_dict.ParamsDict): subconfig_type = Config if k in cls.__annotations__: # Directly Config subtype. - type_annotation = cls.__annotations__[k] + type_annotation = cls.__annotations__[k] # pytype: disable=invalid-annotation if (isinstance(type_annotation, type) and issubclass(type_annotation, Config)): - subconfig_type = cls.__annotations__[k] + subconfig_type = cls.__annotations__[k] # pytype: disable=invalid-annotation else: # Check if the field is a sequence of subtypes. field_type = getattr(type_annotation, '__origin__', type(None)) diff --git a/official/modeling/hyperparams/config_definitions.py b/official/modeling/hyperparams/config_definitions.py index 2fbcdea4455aa0f11728a3b077c4d981df8682cd..c58b1de7fa5c728d549396ef8aaead0376e96963 100644 --- a/official/modeling/hyperparams/config_definitions.py +++ b/official/modeling/hyperparams/config_definitions.py @@ -14,13 +14,13 @@ # limitations under the License. # ============================================================================== """Common configuration settings.""" + from typing import Optional, Union import dataclasses from official.modeling.hyperparams import base_config from official.modeling.optimization.configs import optimization_config -from official.utils import registry OptimizationConfig = optimization_config.OptimizationConfig @@ -111,6 +111,8 @@ class RuntimeConfig(base_config.Config): run_eagerly: Whether or not to run the experiment eagerly. batchnorm_spatial_persistent: Whether or not to enable the spatial persistent mode for CuDNN batch norm kernel for improved GPU performance. + allow_tpu_summary: Whether to allow summary happen inside the XLA program + runs on TPU through automatic outside compilation. """ distribution_strategy: str = "mirrored" enable_xla: bool = False @@ -123,8 +125,8 @@ class RuntimeConfig(base_config.Config): task_index: int = -1 all_reduce_alg: Optional[str] = None num_packs: int = 1 - loss_scale: Optional[Union[str, float]] = None mixed_precision_dtype: Optional[str] = None + loss_scale: Optional[Union[str, float]] = None run_eagerly: bool = False batchnorm_spatial_persistent: bool = False @@ -172,25 +174,39 @@ class TrainerConfig(base_config.Config): eval_tf_function: whether or not to use tf_function for eval. steps_per_loop: number of steps per loop. summary_interval: number of steps between each summary. - checkpoint_intervals: number of steps between checkpoints. + checkpoint_interval: number of steps between checkpoints. max_to_keep: max checkpoints to keep. continuous_eval_timeout: maximum number of seconds to wait between - checkpoints, if set to None, continuous eval will wait indefinetely. + checkpoints, if set to None, continuous eval will wait indefinitely. + This is only used continuous_train_and_eval and continuous_eval modes. + train_steps: number of train steps. + validation_steps: number of eval steps. If `None`, the entire eval dataset + is used. + validation_interval: number of training steps to run between evaluations. """ optimizer_config: OptimizationConfig = OptimizationConfig() + # Orbit settings. train_tf_while_loop: bool = True train_tf_function: bool = True eval_tf_function: bool = True + allow_tpu_summary: bool = False + # Trainer intervals. steps_per_loop: int = 1000 summary_interval: int = 1000 checkpoint_interval: int = 1000 + # Checkpoint manager. max_to_keep: int = 5 continuous_eval_timeout: Optional[int] = None + # Train/Eval routines. + train_steps: int = 0 + validation_steps: Optional[int] = None + validation_interval: int = 1000 @dataclasses.dataclass class TaskConfig(base_config.Config): - network: base_config.Config = None + init_checkpoint: str = "" + model: base_config.Config = None train_data: DataConfig = DataConfig() validation_data: DataConfig = DataConfig() @@ -198,24 +214,7 @@ class TaskConfig(base_config.Config): @dataclasses.dataclass class ExperimentConfig(base_config.Config): """Top-level configuration.""" - mode: str = "train" # train, eval, train_and_eval. task: TaskConfig = TaskConfig() trainer: TrainerConfig = TrainerConfig() runtime: RuntimeConfig = RuntimeConfig() - train_steps: int = 0 - validation_steps: Optional[int] = None - validation_interval: int = 100 - - -_REGISTERED_CONFIGS = {} - - -def register_config_factory(name): - """Register ExperimentConfig factory method.""" - return registry.register(_REGISTERED_CONFIGS, name) - -def get_exp_config_creater(exp_name: str): - """Looks up ExperimentConfig factory methods.""" - exp_creater = registry.lookup(_REGISTERED_CONFIGS, exp_name) - return exp_creater diff --git a/official/modeling/optimization/configs/learning_rate_config.py b/official/modeling/optimization/configs/learning_rate_config.py index b55c713f1905cf9aaa52f87a6663d3385628d5a5..2a0625e0a75040e115e91c6be5b89bddb0de06b0 100644 --- a/official/modeling/optimization/configs/learning_rate_config.py +++ b/official/modeling/optimization/configs/learning_rate_config.py @@ -20,6 +20,20 @@ import dataclasses from official.modeling.hyperparams import base_config +@dataclasses.dataclass +class ConstantLrConfig(base_config.Config): + """Configuration for constant learning rate. + + This class is a containers for the constant learning rate decay configs. + + Attributes: + name: The name of the learning rate schedule. Defaults to Constant. + learning_rate: A float. The learning rate. Defaults to 0.1. + """ + name: str = 'Constant' + learning_rate: float = 0.1 + + @dataclasses.dataclass class StepwiseLrConfig(base_config.Config): """Configuration for stepwise learning rate decay. diff --git a/official/modeling/optimization/configs/optimization_config.py b/official/modeling/optimization/configs/optimization_config.py index 8aba9943ae3bf3f4a9d0c1df4d715d63ef0a26a8..23e112e1b6197a8505a18b9b8d573012d1dd5e73 100644 --- a/official/modeling/optimization/configs/optimization_config.py +++ b/official/modeling/optimization/configs/optimization_config.py @@ -39,12 +39,14 @@ class OptimizerConfig(oneof.OneOfConfig): adam: adam optimizer config. adamw: adam with weight decay. lamb: lamb optimizer. + rmsprop: rmsprop optimizer. """ type: Optional[str] = None sgd: opt_cfg.SGDConfig = opt_cfg.SGDConfig() adam: opt_cfg.AdamConfig = opt_cfg.AdamConfig() adamw: opt_cfg.AdamWeightDecayConfig = opt_cfg.AdamWeightDecayConfig() lamb: opt_cfg.LAMBConfig = opt_cfg.LAMBConfig() + rmsprop: opt_cfg.RMSPropConfig = opt_cfg.RMSPropConfig() @dataclasses.dataclass @@ -53,12 +55,14 @@ class LrConfig(oneof.OneOfConfig): Attributes: type: 'str', type of lr schedule to be used, on the of fields below. + constant: constant learning rate config. stepwise: stepwise learning rate config. exponential: exponential learning rate config. polynomial: polynomial learning rate config. cosine: cosine learning rate config. """ type: Optional[str] = None + constant: lr_cfg.ConstantLrConfig = lr_cfg.ConstantLrConfig() stepwise: lr_cfg.StepwiseLrConfig = lr_cfg.StepwiseLrConfig() exponential: lr_cfg.ExponentialLrConfig = lr_cfg.ExponentialLrConfig() polynomial: lr_cfg.PolynomialLrConfig = lr_cfg.PolynomialLrConfig() diff --git a/official/modeling/optimization/configs/optimizer_config.py b/official/modeling/optimization/configs/optimizer_config.py index 4cafa9659119386d2583d8b52cb2ddf9afe37131..5e7ca2d0c195883b0af7a5920bc13402bada4139 100644 --- a/official/modeling/optimization/configs/optimizer_config.py +++ b/official/modeling/optimization/configs/optimizer_config.py @@ -28,18 +28,37 @@ class SGDConfig(base_config.Config): Attributes: name: name of the optimizer. - learning_rate: learning_rate for SGD optimizer. decay: decay rate for SGD optimizer. nesterov: nesterov for SGD optimizer. momentum: momentum for SGD optimizer. """ name: str = "SGD" - learning_rate: float = 0.01 decay: float = 0.0 nesterov: bool = False momentum: float = 0.0 +@dataclasses.dataclass +class RMSPropConfig(base_config.Config): + """Configuration for RMSProp optimizer. + + The attributes for this class matches the arguments of + tf.keras.optimizers.RMSprop. + + Attributes: + name: name of the optimizer. + rho: discounting factor for RMSprop optimizer. + momentum: momentum for RMSprop optimizer. + epsilon: epsilon value for RMSprop optimizer, help with numerical stability. + centered: Whether to normalize gradients or not. + """ + name: str = "RMSprop" + rho: float = 0.9 + momentum: float = 0.0 + epsilon: float = 1e-7 + centered: bool = False + + @dataclasses.dataclass class AdamConfig(base_config.Config): """Configuration for Adam optimizer. @@ -49,7 +68,6 @@ class AdamConfig(base_config.Config): Attributes: name: name of the optimizer. - learning_rate: learning_rate for Adam optimizer. beta_1: decay rate for 1st order moments. beta_2: decay rate for 2st order moments. epsilon: epsilon value used for numerical stability in Adam optimizer. @@ -57,7 +75,6 @@ class AdamConfig(base_config.Config): the paper "On the Convergence of Adam and beyond". """ name: str = "Adam" - learning_rate: float = 0.001 beta_1: float = 0.9 beta_2: float = 0.999 epsilon: float = 1e-07 @@ -70,7 +87,6 @@ class AdamWeightDecayConfig(base_config.Config): Attributes: name: name of the optimizer. - learning_rate: learning_rate for the optimizer. beta_1: decay rate for 1st order moments. beta_2: decay rate for 2st order moments. epsilon: epsilon value used for numerical stability in the optimizer. @@ -83,7 +99,6 @@ class AdamWeightDecayConfig(base_config.Config): include in weight decay. """ name: str = "AdamWeightDecay" - learning_rate: float = 0.001 beta_1: float = 0.9 beta_2: float = 0.999 epsilon: float = 1e-07 @@ -102,7 +117,6 @@ class LAMBConfig(base_config.Config): Attributes: name: name of the optimizer. - learning_rate: learning_rate for Adam optimizer. beta_1: decay rate for 1st order moments. beta_2: decay rate for 2st order moments. epsilon: epsilon value used for numerical stability in LAMB optimizer. @@ -116,7 +130,6 @@ class LAMBConfig(base_config.Config): be excluded. """ name: str = "LAMB" - learning_rate: float = 0.001 beta_1: float = 0.9 beta_2: float = 0.999 epsilon: float = 1e-6 diff --git a/official/modeling/optimization/optimizer_factory.py b/official/modeling/optimization/optimizer_factory.py index 0988f6b3dd7ecc7b99e6f12e617aacba409d1fa3..c9ac04c42213c1a5904f162f369148ec43b0af82 100644 --- a/official/modeling/optimization/optimizer_factory.py +++ b/official/modeling/optimization/optimizer_factory.py @@ -14,7 +14,6 @@ # limitations under the License. # ============================================================================== """Optimizer factory class.""" - from typing import Union import tensorflow as tf @@ -29,7 +28,8 @@ OPTIMIZERS_CLS = { 'sgd': tf.keras.optimizers.SGD, 'adam': tf.keras.optimizers.Adam, 'adamw': nlp_optimization.AdamWeightDecay, - 'lamb': tfa_optimizers.LAMB + 'lamb': tfa_optimizers.LAMB, + 'rmsprop': tf.keras.optimizers.RMSprop } LR_CLS = { @@ -60,7 +60,7 @@ class OptimizerFactory(object): params = { 'optimizer': { 'type': 'sgd', - 'sgd': {'learning_rate': 0.1, 'momentum': 0.9} + 'sgd': {'momentum': 0.9} }, 'learning_rate': { 'type': 'stepwise', @@ -88,12 +88,15 @@ class OptimizerFactory(object): self._optimizer_config = config.optimizer.get() self._optimizer_type = config.optimizer.type - if self._optimizer_config is None: + if self._optimizer_type is None: raise ValueError('Optimizer type must be specified') self._lr_config = config.learning_rate.get() self._lr_type = config.learning_rate.type + if self._lr_type is None: + raise ValueError('Learning rate type must be specified') + self._warmup_config = config.warmup.get() self._warmup_type = config.warmup.type @@ -101,18 +104,15 @@ class OptimizerFactory(object): """Build learning rate. Builds learning rate from config. Learning rate schedule is built according - to the learning rate config. If there is no learning rate config, optimizer - learning rate is returned. + to the learning rate config. If learning rate type is consant, + lr_config.learning_rate is returned. Returns: - tf.keras.optimizers.schedules.LearningRateSchedule instance. If no - learning rate schedule defined, optimizer_config.learning_rate is - returned. + tf.keras.optimizers.schedules.LearningRateSchedule instance. If + learning rate type is consant, lr_config.learning_rate is returned. """ - - # TODO(arashwan): Explore if we want to only allow explicit const lr sched. - if not self._lr_config: - lr = self._optimizer_config.learning_rate + if self._lr_type == 'constant': + lr = self._lr_config.learning_rate else: lr = LR_CLS[self._lr_type](**self._lr_config.as_dict()) diff --git a/official/modeling/optimization/optimizer_factory_test.py b/official/modeling/optimization/optimizer_factory_test.py index d7ffa16cfaf3abcd3264f7144afd9e31c81bb272..b3218778528eea895fc83c4da59ad5bcccbfa655 100644 --- a/official/modeling/optimization/optimizer_factory_test.py +++ b/official/modeling/optimization/optimizer_factory_test.py @@ -15,91 +15,72 @@ # ============================================================================== """Tests for optimizer_factory.py.""" +from absl.testing import parameterized + import tensorflow as tf -import tensorflow_addons.optimizers as tfa_optimizers from official.modeling.optimization import optimizer_factory from official.modeling.optimization.configs import optimization_config -from official.nlp import optimization as nlp_optimization - - -class OptimizerFactoryTest(tf.test.TestCase): - - def test_sgd_optimizer(self): - params = { - 'optimizer': { - 'type': 'sgd', - 'sgd': {'learning_rate': 0.1, 'momentum': 0.9} - } - } - expected_optimizer_config = { - 'name': 'SGD', - 'learning_rate': 0.1, - 'decay': 0.0, - 'momentum': 0.9, - 'nesterov': False - } - opt_config = optimization_config.OptimizationConfig(params) - opt_factory = optimizer_factory.OptimizerFactory(opt_config) - lr = opt_factory.build_learning_rate() - optimizer = opt_factory.build_optimizer(lr) - self.assertIsInstance(optimizer, tf.keras.optimizers.SGD) - self.assertEqual(expected_optimizer_config, optimizer.get_config()) - def test_adam_optimizer(self): +class OptimizerFactoryTest(tf.test.TestCase, parameterized.TestCase): - # Define adam optimizer with default values. + @parameterized.parameters( + ('sgd'), + ('rmsprop'), + ('adam'), + ('adamw'), + ('lamb')) + def test_optimizers(self, optimizer_type): params = { 'optimizer': { - 'type': 'adam' + 'type': optimizer_type + }, + 'learning_rate': { + 'type': 'constant', + 'constant': { + 'learning_rate': 0.1 + } } } - expected_optimizer_config = tf.keras.optimizers.Adam().get_config() + optimizer_cls = optimizer_factory.OPTIMIZERS_CLS[optimizer_type] + expected_optimizer_config = optimizer_cls().get_config() + expected_optimizer_config['learning_rate'] = 0.1 opt_config = optimization_config.OptimizationConfig(params) opt_factory = optimizer_factory.OptimizerFactory(opt_config) lr = opt_factory.build_learning_rate() optimizer = opt_factory.build_optimizer(lr) - self.assertIsInstance(optimizer, tf.keras.optimizers.Adam) + self.assertIsInstance(optimizer, optimizer_cls) self.assertEqual(expected_optimizer_config, optimizer.get_config()) - def test_adam_weight_decay_optimizer(self): + def test_missing_types(self): params = { 'optimizer': { - 'type': 'adamw' + 'type': 'sgd', + 'sgd': {'momentum': 0.9} } } - expected_optimizer_config = nlp_optimization.AdamWeightDecay().get_config() - opt_config = optimization_config.OptimizationConfig(params) - opt_factory = optimizer_factory.OptimizerFactory(opt_config) - lr = opt_factory.build_learning_rate() - optimizer = opt_factory.build_optimizer(lr) - - self.assertIsInstance(optimizer, nlp_optimization.AdamWeightDecay) - self.assertEqual(expected_optimizer_config, optimizer.get_config()) - - def test_lamb_optimizer(self): + with self.assertRaises(ValueError): + optimizer_factory.OptimizerFactory( + optimization_config.OptimizationConfig(params)) params = { - 'optimizer': { - 'type': 'lamb' + 'learning_rate': { + 'type': 'stepwise', + 'stepwise': {'boundaries': [10000, 20000], + 'values': [0.1, 0.01, 0.001]} } } - expected_optimizer_config = tfa_optimizers.LAMB().get_config() - opt_config = optimization_config.OptimizationConfig(params) - opt_factory = optimizer_factory.OptimizerFactory(opt_config) - lr = opt_factory.build_learning_rate() - optimizer = opt_factory.build_optimizer(lr) - - self.assertIsInstance(optimizer, tfa_optimizers.LAMB) - self.assertEqual(expected_optimizer_config, optimizer.get_config()) + with self.assertRaises(ValueError): + optimizer_factory.OptimizerFactory( + optimization_config.OptimizationConfig(params)) def test_stepwise_lr_schedule(self): params = { 'optimizer': { 'type': 'sgd', - 'sgd': {'learning_rate': 0.1, 'momentum': 0.9} + 'sgd': {'momentum': 0.9} }, 'learning_rate': { 'type': 'stepwise', @@ -126,7 +107,7 @@ class OptimizerFactoryTest(tf.test.TestCase): params = { 'optimizer': { 'type': 'sgd', - 'sgd': {'learning_rate': 0.1, 'momentum': 0.9} + 'sgd': {'momentum': 0.9} }, 'learning_rate': { 'type': 'stepwise', @@ -159,7 +140,7 @@ class OptimizerFactoryTest(tf.test.TestCase): params = { 'optimizer': { 'type': 'sgd', - 'sgd': {'learning_rate': 0.1, 'momentum': 0.9} + 'sgd': {'momentum': 0.9} }, 'learning_rate': { 'type': 'exponential', @@ -189,7 +170,7 @@ class OptimizerFactoryTest(tf.test.TestCase): params = { 'optimizer': { 'type': 'sgd', - 'sgd': {'learning_rate': 0.1, 'momentum': 0.9} + 'sgd': {'momentum': 0.9} }, 'learning_rate': { 'type': 'polynomial', @@ -213,7 +194,7 @@ class OptimizerFactoryTest(tf.test.TestCase): params = { 'optimizer': { 'type': 'sgd', - 'sgd': {'learning_rate': 0.1, 'momentum': 0.9} + 'sgd': {'momentum': 0.9} }, 'learning_rate': { 'type': 'cosine', @@ -239,7 +220,13 @@ class OptimizerFactoryTest(tf.test.TestCase): params = { 'optimizer': { 'type': 'sgd', - 'sgd': {'learning_rate': 0.1, 'momentum': 0.9} + 'sgd': {'momentum': 0.9} + }, + 'learning_rate': { + 'type': 'constant', + 'constant': { + 'learning_rate': 0.1 + } }, 'warmup': { 'type': 'linear', @@ -263,7 +250,7 @@ class OptimizerFactoryTest(tf.test.TestCase): params = { 'optimizer': { 'type': 'sgd', - 'sgd': {'learning_rate': 0.1, 'momentum': 0.9} + 'sgd': {'momentum': 0.9} }, 'learning_rate': { 'type': 'stepwise', diff --git a/official/modeling/tf_utils.py b/official/modeling/tf_utils.py index 34f8f66e75733493d6e061b8f0b9571c1e038f6c..14b6a3f1f8f64635ee90facc1874e359a2d05229 100644 --- a/official/modeling/tf_utils.py +++ b/official/modeling/tf_utils.py @@ -88,7 +88,6 @@ def is_special_none_tensor(tensor): return tensor.shape.ndims == 0 and tensor.dtype == tf.int32 -# TODO(hongkuny): consider moving custom string-map lookup to keras api. def get_activation(identifier): """Maps a identifier to a Python function, e.g., "relu" => `tf.nn.relu`. @@ -173,3 +172,18 @@ def assert_rank(tensor, expected_rank, name=None): "For the tensor `%s`, the actual tensor rank `%d` (shape = %s) is not " "equal to the expected tensor rank `%s`" % (name, actual_rank, str(tensor.shape), str(expected_rank))) + + +def safe_mean(losses): + """Computes a safe mean of the losses. + + Args: + losses: `Tensor` whose elements contain individual loss measurements. + + Returns: + A scalar representing the mean of `losses`. If `num_present` is zero, + then zero is returned. + """ + total = tf.reduce_sum(losses) + num_elements = tf.cast(tf.size(losses), dtype=losses.dtype) + return tf.math.divide_no_nan(total, num_elements) diff --git a/official/modeling/training/distributed_executor.py b/official/modeling/training/distributed_executor.py index 11451260cdca52a9c9f4019010123c4d2b40e99e..4aeaa2b41d21704dadbe71510912d5ccab6b8be0 100644 --- a/official/modeling/training/distributed_executor.py +++ b/official/modeling/training/distributed_executor.py @@ -63,8 +63,8 @@ def metrics_as_dict(metric): """Puts input metric(s) into a list. Args: - metric: metric(s) to be put into the list. `metric` could be a object, a - list or a dict of tf.keras.metrics.Metric or has the `required_method`. + metric: metric(s) to be put into the list. `metric` could be an object, a + list, or a dict of tf.keras.metrics.Metric or has the `required_method`. Returns: A dictionary of valid metrics. @@ -351,7 +351,8 @@ class DistributedExecutor(object): train_input_fn: (params: dict) -> tf.data.Dataset training data input function. eval_input_fn: (Optional) same type as train_input_fn. If not None, will - trigger evaluting metric on eval data. If None, will not run eval step. + trigger evaluating metric on eval data. If None, will not run the eval + step. model_dir: the folder path for model checkpoints. total_steps: total training steps. iterations_per_loop: train steps per loop. After each loop, this job will @@ -672,7 +673,7 @@ class DistributedExecutor(object): raise ValueError('if `eval_metric_fn` is specified, ' 'eval_metric_fn must be a callable.') - old_phrase = tf.keras.backend.learning_phase() + old_phase = tf.keras.backend.learning_phase() tf.keras.backend.set_learning_phase(0) params = self._params strategy = self._strategy @@ -698,7 +699,8 @@ class DistributedExecutor(object): logging.info( 'Checkpoint file %s found and restoring from ' 'checkpoint', checkpoint_path) - checkpoint.restore(checkpoint_path) + status = checkpoint.restore(checkpoint_path) + status.expect_partial().assert_existing_objects_matched() self.global_train_step = model.optimizer.iterations eval_iterator = self._get_input_iterator(eval_input_fn, strategy) @@ -709,7 +711,7 @@ class DistributedExecutor(object): summary_writer(metrics=eval_metric_result, step=current_step) reset_states(eval_metric) - tf.keras.backend.set_learning_phase(old_phrase) + tf.keras.backend.set_learning_phase(old_phase) return eval_metric_result, current_step def predict(self): @@ -759,7 +761,7 @@ class ExecutorBuilder(object): Args: strategy_type: string. One of 'tpu', 'mirrored', 'multi_worker_mirrored'. - If None. User is responsible to set the strategy before calling + If None, the user is responsible to set the strategy before calling build_executor(...). strategy_config: necessary config for constructing the proper Strategy. Check strategy_flags_dict() for examples of the structure. diff --git a/official/nlp/albert/run_classifier.py b/official/nlp/albert/run_classifier.py index fe72ff880f61c99e304bf089ef4ed0d75bfc349b..7b1371cc052775d3182c51a36926add43dee416e 100644 --- a/official/nlp/albert/run_classifier.py +++ b/official/nlp/albert/run_classifier.py @@ -14,23 +14,61 @@ # ============================================================================== """ALBERT classification finetuning runner in tf2.x.""" + from __future__ import absolute_import from __future__ import division from __future__ import print_function import json - +import os from absl import app from absl import flags +from absl import logging import tensorflow as tf from official.nlp.albert import configs as albert_configs +from official.nlp.bert import bert_models from official.nlp.bert import run_classifier as run_classifier_bert from official.utils.misc import distribution_utils + FLAGS = flags.FLAGS +def predict(strategy, albert_config, input_meta_data, predict_input_fn): + """Function outputs both the ground truth predictions as .tsv files.""" + with strategy.scope(): + classifier_model = bert_models.classifier_model( + albert_config, input_meta_data['num_labels'])[0] + checkpoint = tf.train.Checkpoint(model=classifier_model) + latest_checkpoint_file = ( + FLAGS.predict_checkpoint_path or + tf.train.latest_checkpoint(FLAGS.model_dir)) + assert latest_checkpoint_file + logging.info('Checkpoint file %s found and restoring from ' + 'checkpoint', latest_checkpoint_file) + checkpoint.restore( + latest_checkpoint_file).assert_existing_objects_matched() + preds, ground_truth = run_classifier_bert.get_predictions_and_labels( + strategy, classifier_model, predict_input_fn, return_probs=True) + output_predict_file = os.path.join(FLAGS.model_dir, 'test_results.tsv') + with tf.io.gfile.GFile(output_predict_file, 'w') as writer: + logging.info('***** Predict results *****') + for probabilities in preds: + output_line = '\t'.join( + str(class_probability) + for class_probability in probabilities) + '\n' + writer.write(output_line) + ground_truth_labels_file = os.path.join(FLAGS.model_dir, + 'output_labels.tsv') + with tf.io.gfile.GFile(ground_truth_labels_file, 'w') as writer: + logging.info('***** Ground truth results *****') + for label in ground_truth: + output_line = '\t'.join(str(label)) + '\n' + writer.write(output_line) + return + + def main(_): with tf.io.gfile.GFile(FLAGS.input_meta_data_path, 'rb') as reader: input_meta_data = json.loads(reader.read().decode('utf-8')) @@ -56,9 +94,14 @@ def main(_): albert_config = albert_configs.AlbertConfig.from_json_file( FLAGS.bert_config_file) - run_classifier_bert.run_bert(strategy, input_meta_data, albert_config, - train_input_fn, eval_input_fn) - + if FLAGS.mode == 'train_and_eval': + run_classifier_bert.run_bert(strategy, input_meta_data, albert_config, + train_input_fn, eval_input_fn) + elif FLAGS.mode == 'predict': + predict(strategy, albert_config, input_meta_data, eval_input_fn) + else: + raise ValueError('Unsupported mode is specified: %s' % FLAGS.mode) + return if __name__ == '__main__': flags.mark_flag_as_required('bert_config_file') diff --git a/official/nlp/albert/tf2_albert_encoder_checkpoint_converter.py b/official/nlp/albert/tf2_albert_encoder_checkpoint_converter.py index 402bc1445bed575362598d09212d14d03b629179..afd2ab19d6af157a24cf691b57c209d3dfd5f1fe 100644 --- a/official/nlp/albert/tf2_albert_encoder_checkpoint_converter.py +++ b/official/nlp/albert/tf2_albert_encoder_checkpoint_converter.py @@ -86,7 +86,7 @@ def _create_albert_model(cfg): activation=activations.gelu, dropout_rate=cfg.hidden_dropout_prob, attention_dropout_rate=cfg.attention_probs_dropout_prob, - sequence_length=cfg.max_position_embeddings, + max_sequence_length=cfg.max_position_embeddings, type_vocab_size=cfg.type_vocab_size, initializer=tf.keras.initializers.TruncatedNormal( stddev=cfg.initializer_range)) diff --git a/official/nlp/bert/bert_models.py b/official/nlp/bert/bert_models.py index e26c2a0caa0e0a3fe4881df137e1016614a39137..807c96581dce4118afae365364acae2b12f6415b 100644 --- a/official/nlp/bert/bert_models.py +++ b/official/nlp/bert/bert_models.py @@ -25,7 +25,6 @@ import tensorflow_hub as hub from official.modeling import tf_utils from official.nlp.albert import configs as albert_configs from official.nlp.bert import configs -from official.nlp.modeling import losses from official.nlp.modeling import models from official.nlp.modeling import networks @@ -67,22 +66,27 @@ class BertPretrainLossAndMetricLayer(tf.keras.layers.Layer): next_sentence_loss, name='next_sentence_loss', aggregation='mean') def call(self, - lm_output, - sentence_output, + lm_output_logits, + sentence_output_logits, lm_label_ids, lm_label_weights, sentence_labels=None): """Implements call() for the layer.""" lm_label_weights = tf.cast(lm_label_weights, tf.float32) - lm_output = tf.cast(lm_output, tf.float32) + lm_output_logits = tf.cast(lm_output_logits, tf.float32) - mask_label_loss = losses.weighted_sparse_categorical_crossentropy_loss( - labels=lm_label_ids, predictions=lm_output, weights=lm_label_weights) + lm_prediction_losses = tf.keras.losses.sparse_categorical_crossentropy( + lm_label_ids, lm_output_logits, from_logits=True) + lm_numerator_loss = tf.reduce_sum(lm_prediction_losses * lm_label_weights) + lm_denominator_loss = tf.reduce_sum(lm_label_weights) + mask_label_loss = tf.math.divide_no_nan(lm_numerator_loss, + lm_denominator_loss) if sentence_labels is not None: - sentence_output = tf.cast(sentence_output, tf.float32) - sentence_loss = losses.weighted_sparse_categorical_crossentropy_loss( - labels=sentence_labels, predictions=sentence_output) + sentence_output_logits = tf.cast(sentence_output_logits, tf.float32) + sentence_loss = tf.keras.losses.sparse_categorical_crossentropy( + sentence_labels, sentence_output_logits, from_logits=True) + sentence_loss = tf.reduce_mean(sentence_loss) loss = mask_label_loss + sentence_loss else: sentence_loss = None @@ -92,22 +96,22 @@ class BertPretrainLossAndMetricLayer(tf.keras.layers.Layer): # TODO(hongkuny): Avoids the hack and switches add_loss. final_loss = tf.fill(batch_shape, loss) - self._add_metrics(lm_output, lm_label_ids, lm_label_weights, - mask_label_loss, sentence_output, sentence_labels, + self._add_metrics(lm_output_logits, lm_label_ids, lm_label_weights, + mask_label_loss, sentence_output_logits, sentence_labels, sentence_loss) return final_loss @gin.configurable def get_transformer_encoder(bert_config, - sequence_length, + sequence_length=None, transformer_encoder_cls=None, output_range=None): """Gets a 'TransformerEncoder' object. Args: bert_config: A 'modeling.BertConfig' or 'modeling.AlbertConfig' object. - sequence_length: Maximum sequence length of the training data. + sequence_length: [Deprecated]. transformer_encoder_cls: A EncoderScaffold class. If it is None, uses the default BERT encoder implementation. output_range: the sequence output range, [0, output_range). Default setting @@ -116,13 +120,13 @@ def get_transformer_encoder(bert_config, Returns: A networks.TransformerEncoder object. """ + del sequence_length if transformer_encoder_cls is not None: # TODO(hongkuny): evaluate if it is better to put cfg definition in gin. embedding_cfg = dict( vocab_size=bert_config.vocab_size, type_vocab_size=bert_config.type_vocab_size, hidden_size=bert_config.hidden_size, - seq_length=sequence_length, max_seq_length=bert_config.max_position_embeddings, initializer=tf.keras.initializers.TruncatedNormal( stddev=bert_config.initializer_range), @@ -157,7 +161,6 @@ def get_transformer_encoder(bert_config, activation=tf_utils.get_activation(bert_config.hidden_act), dropout_rate=bert_config.hidden_dropout_prob, attention_dropout_rate=bert_config.attention_probs_dropout_prob, - sequence_length=sequence_length, max_sequence_length=bert_config.max_position_embeddings, type_vocab_size=bert_config.type_vocab_size, embedding_width=bert_config.embedding_size, @@ -228,7 +231,7 @@ def pretrain_model(bert_config, activation=tf_utils.get_activation(bert_config.hidden_act), num_token_predictions=max_predictions_per_seq, initializer=initializer, - output='predictions') + output='logits') outputs = pretrainer_model( [input_word_ids, input_mask, input_type_ids, masked_lm_positions]) diff --git a/official/nlp/bert/bert_models_test.py b/official/nlp/bert/bert_models_test.py index 93763b45bfc53c5d32de2df7f7f0f72894e9556f..0c6e3ec43b55db1bd3a53754cf176c0db8cfadf1 100644 --- a/official/nlp/bert/bert_models_test.py +++ b/official/nlp/bert/bert_models_test.py @@ -56,8 +56,6 @@ class BertModelsTest(tf.test.TestCase): # Expect two output from encoder: sequence and classification output. self.assertIsInstance(encoder.output, list) self.assertLen(encoder.output, 2) - # shape should be [batch size, seq_length, hidden_size] - self.assertEqual(encoder.output[0].shape.as_list(), [None, 5, 16]) # shape should be [batch size, hidden_size] self.assertEqual(encoder.output[1].shape.as_list(), [None, 16]) @@ -74,16 +72,12 @@ class BertModelsTest(tf.test.TestCase): # Expect two output from model: start positions and end positions self.assertIsInstance(model.output, list) self.assertLen(model.output, 2) - # shape should be [batch size, seq_length] - self.assertEqual(model.output[0].shape.as_list(), [None, 5]) - # shape should be [batch size, seq_length] - self.assertEqual(model.output[1].shape.as_list(), [None, 5]) # Expect two output from core_model: sequence and classification output. self.assertIsInstance(core_model.output, list) self.assertLen(core_model.output, 2) - # shape should be [batch size, seq_length, hidden_size] - self.assertEqual(core_model.output[0].shape.as_list(), [None, 5, 16]) + # shape should be [batch size, None, hidden_size] + self.assertEqual(core_model.output[0].shape.as_list(), [None, None, 16]) # shape should be [batch size, hidden_size] self.assertEqual(core_model.output[1].shape.as_list(), [None, 16]) @@ -104,8 +98,8 @@ class BertModelsTest(tf.test.TestCase): # Expect two output from core_model: sequence and classification output. self.assertIsInstance(core_model.output, list) self.assertLen(core_model.output, 2) - # shape should be [batch size, 1, hidden_size] - self.assertEqual(core_model.output[0].shape.as_list(), [None, 1, 16]) + # shape should be [batch size, None, hidden_size] + self.assertEqual(core_model.output[0].shape.as_list(), [None, None, 16]) # shape should be [batch size, hidden_size] self.assertEqual(core_model.output[1].shape.as_list(), [None, 16]) diff --git a/official/nlp/bert/export_tfhub.py b/official/nlp/bert/export_tfhub.py index 5923309d1fa36a16d4cccda11650d9c3d0fcc616..5a49a3df54a64ceacbe1235b870d17bc84d8a488 100644 --- a/official/nlp/bert/export_tfhub.py +++ b/official/nlp/bert/export_tfhub.py @@ -79,7 +79,7 @@ def export_bert_tfhub(bert_config: configs.BertConfig, do_lower_case, vocab_file) core_model, encoder = create_bert_model(bert_config) checkpoint = tf.train.Checkpoint(model=encoder) - checkpoint.restore(model_checkpoint_path).assert_consumed() + checkpoint.restore(model_checkpoint_path).assert_existing_objects_matched() core_model.vocab_file = tf.saved_model.Asset(vocab_file) core_model.do_lower_case = tf.Variable(do_lower_case, trainable=False) core_model.save(hub_destination, include_optimizer=False, save_format="tf") diff --git a/official/nlp/bert/input_pipeline.py b/official/nlp/bert/input_pipeline.py index 73c2a096ef6cf71b64929f78d5fdee33b9a8692f..ed3fd173d4379a75ab1e2e5a9ba0bbdcbaa0be42 100644 --- a/official/nlp/bert/input_pipeline.py +++ b/official/nlp/bert/input_pipeline.py @@ -247,3 +247,39 @@ def create_squad_dataset(file_path, dataset = dataset.batch(batch_size, drop_remainder=True) dataset = dataset.prefetch(tf.data.experimental.AUTOTUNE) return dataset + + +def create_retrieval_dataset(file_path, + seq_length, + batch_size, + input_pipeline_context=None): + """Creates input dataset from (tf)records files for scoring.""" + name_to_features = { + 'input_ids': tf.io.FixedLenFeature([seq_length], tf.int64), + 'input_mask': tf.io.FixedLenFeature([seq_length], tf.int64), + 'segment_ids': tf.io.FixedLenFeature([seq_length], tf.int64), + 'int_iden': tf.io.FixedLenFeature([1], tf.int64), + } + dataset = single_file_dataset(file_path, name_to_features) + + # The dataset is always sharded by number of hosts. + # num_input_pipelines is the number of hosts rather than number of cores. + if input_pipeline_context and input_pipeline_context.num_input_pipelines > 1: + dataset = dataset.shard(input_pipeline_context.num_input_pipelines, + input_pipeline_context.input_pipeline_id) + + def _select_data_from_record(record): + x = { + 'input_word_ids': record['input_ids'], + 'input_mask': record['input_mask'], + 'input_type_ids': record['segment_ids'] + } + y = record['int_iden'] + return (x, y) + + dataset = dataset.map( + _select_data_from_record, + num_parallel_calls=tf.data.experimental.AUTOTUNE) + dataset = dataset.batch(batch_size, drop_remainder=False) + dataset = dataset.prefetch(tf.data.experimental.AUTOTUNE) + return dataset diff --git a/official/nlp/bert/model_saving_utils.py b/official/nlp/bert/model_saving_utils.py index 13d2c9ed02f9a98d9dcbb2a60c46fa5cd13bb666..24e39c6e4af02757d81dcc380612148da5891ac5 100644 --- a/official/nlp/bert/model_saving_utils.py +++ b/official/nlp/bert/model_saving_utils.py @@ -55,14 +55,10 @@ def export_bert_model(model_export_path: typing.Text, raise ValueError('model must be a tf.keras.Model object.') if checkpoint_dir: - # Keras compile/fit() was used to save checkpoint using - # model.save_weights(). if restore_model_using_load_weights: model_weight_path = os.path.join(checkpoint_dir, 'checkpoint') assert tf.io.gfile.exists(model_weight_path) model.load_weights(model_weight_path) - - # tf.train.Checkpoint API was used via custom training loop logic. else: checkpoint = tf.train.Checkpoint(model=model) diff --git a/official/nlp/bert/model_training_utils.py b/official/nlp/bert/model_training_utils.py index f0fe67615726906a6b1d3ef38a5ca9acfe8502de..071e18b3453a7291fd4ece111811ac1e1243a5cd 100644 --- a/official/nlp/bert/model_training_utils.py +++ b/official/nlp/bert/model_training_utils.py @@ -99,7 +99,9 @@ def write_txt_summary(training_summary, summary_dir): @deprecation.deprecated( - None, 'This function is deprecated. Please use Keras compile/fit instead.') + None, 'This function is deprecated and we do not expect adding new ' + 'functionalities. Please do not have your code depending ' + 'on this library.') def run_customized_training_loop( # pylint: disable=invalid-name _sentinel=None, @@ -557,7 +559,6 @@ def run_customized_training_loop( for metric in model.metrics: training_summary[metric.name] = _float_metric_value(metric) if eval_metrics: - # TODO(hongkuny): Cleans up summary reporting in text. training_summary['last_train_metrics'] = _float_metric_value( train_metrics[0]) training_summary['eval_metrics'] = _float_metric_value(eval_metrics[0]) diff --git a/official/nlp/bert/run_classifier.py b/official/nlp/bert/run_classifier.py index e2eb525ae4335091c78eb4ead72494f8021a7f89..c5f3721ada6279f7446ec0d21ce1eeae549afcd8 100644 --- a/official/nlp/bert/run_classifier.py +++ b/official/nlp/bert/run_classifier.py @@ -343,7 +343,10 @@ def export_classifier(model_export_path, input_meta_data, bert_config, # Export uses float32 for now, even if training uses mixed precision. tf.keras.mixed_precision.experimental.set_policy('float32') classifier_model = bert_models.classifier_model( - bert_config, input_meta_data.get('num_labels', 1))[0] + bert_config, + input_meta_data.get('num_labels', 1), + hub_module_url=FLAGS.hub_module_url, + hub_module_trainable=False)[0] model_saving_utils.export_bert_model( model_export_path, model=classifier_model, checkpoint_dir=model_dir) diff --git a/official/nlp/bert/run_squad_helper.py b/official/nlp/bert/run_squad_helper.py index 7f6ea5bbbe2ae2fb6af89f139da989c82b1f893d..b03e356d91bdf6a9edf9486f505526852c6c7ef6 100644 --- a/official/nlp/bert/run_squad_helper.py +++ b/official/nlp/bert/run_squad_helper.py @@ -61,7 +61,11 @@ def define_common_squad_flags(): flags.DEFINE_integer('train_batch_size', 32, 'Total batch size for training.') # Predict processing related. flags.DEFINE_string('predict_file', None, - 'Prediction data path with train tfrecords.') + 'SQuAD prediction json file path. ' + '`predict` mode supports multiple files: one can use ' + 'wildcard to specify multiple files and it can also be ' + 'multiple file patterns separated by comma. Note that ' + '`eval` mode only supports a single predict file.') flags.DEFINE_bool( 'do_lower_case', True, 'Whether to lower case the input text. Should be True for uncased ' @@ -159,22 +163,9 @@ def get_dataset_fn(input_file_pattern, max_seq_length, global_batch_size, return _dataset_fn -def predict_squad_customized(strategy, - input_meta_data, - bert_config, - checkpoint_path, - predict_tfrecord_path, - num_steps): - """Make predictions using a Bert-based squad model.""" - predict_dataset_fn = get_dataset_fn( - predict_tfrecord_path, - input_meta_data['max_seq_length'], - FLAGS.predict_batch_size, - is_training=False) - predict_iterator = iter( - strategy.experimental_distribute_datasets_from_function( - predict_dataset_fn)) - +def get_squad_model_to_predict(strategy, bert_config, checkpoint_path, + input_meta_data): + """Gets a squad model to make predictions.""" with strategy.scope(): # Prediction always uses float32, even if training uses mixed precision. tf.keras.mixed_precision.experimental.set_policy('float32') @@ -188,6 +179,23 @@ def predict_squad_customized(strategy, logging.info('Restoring checkpoints from %s', checkpoint_path) checkpoint = tf.train.Checkpoint(model=squad_model) checkpoint.restore(checkpoint_path).expect_partial() + return squad_model + + +def predict_squad_customized(strategy, + input_meta_data, + predict_tfrecord_path, + num_steps, + squad_model): + """Make predictions using a Bert-based squad model.""" + predict_dataset_fn = get_dataset_fn( + predict_tfrecord_path, + input_meta_data['max_seq_length'], + FLAGS.predict_batch_size, + is_training=False) + predict_iterator = iter( + strategy.experimental_distribute_datasets_from_function( + predict_dataset_fn)) @tf.function def predict_step(iterator): @@ -287,8 +295,8 @@ def train_squad(strategy, post_allreduce_callbacks=[clip_by_global_norm_callback]) -def prediction_output_squad( - strategy, input_meta_data, tokenizer, bert_config, squad_lib, checkpoint): +def prediction_output_squad(strategy, input_meta_data, tokenizer, squad_lib, + predict_file, squad_model): """Makes predictions for a squad dataset.""" doc_stride = input_meta_data['doc_stride'] max_query_length = input_meta_data['max_query_length'] @@ -296,7 +304,7 @@ def prediction_output_squad( version_2_with_negative = input_meta_data.get('version_2_with_negative', False) eval_examples = squad_lib.read_squad_examples( - input_file=FLAGS.predict_file, + input_file=predict_file, is_training=False, version_2_with_negative=version_2_with_negative) @@ -337,8 +345,7 @@ def prediction_output_squad( num_steps = int(dataset_size / FLAGS.predict_batch_size) all_results = predict_squad_customized( - strategy, input_meta_data, bert_config, - checkpoint, eval_writer.filename, num_steps) + strategy, input_meta_data, eval_writer.filename, num_steps, squad_model) all_predictions, all_nbest_json, scores_diff_json = ( squad_lib.postprocess_output( @@ -356,11 +363,14 @@ def prediction_output_squad( def dump_to_files(all_predictions, all_nbest_json, scores_diff_json, - squad_lib, version_2_with_negative): + squad_lib, version_2_with_negative, file_prefix=''): """Save output to json files.""" - output_prediction_file = os.path.join(FLAGS.model_dir, 'predictions.json') - output_nbest_file = os.path.join(FLAGS.model_dir, 'nbest_predictions.json') - output_null_log_odds_file = os.path.join(FLAGS.model_dir, 'null_odds.json') + output_prediction_file = os.path.join(FLAGS.model_dir, + '%spredictions.json' % file_prefix) + output_nbest_file = os.path.join(FLAGS.model_dir, + '%snbest_predictions.json' % file_prefix) + output_null_log_odds_file = os.path.join(FLAGS.model_dir, file_prefix, + '%snull_odds.json' % file_prefix) logging.info('Writing predictions to: %s', (output_prediction_file)) logging.info('Writing nbest to: %s', (output_nbest_file)) @@ -370,6 +380,22 @@ def dump_to_files(all_predictions, all_nbest_json, scores_diff_json, squad_lib.write_to_json_files(scores_diff_json, output_null_log_odds_file) +def _get_matched_files(input_path): + """Returns all files that matches the input_path.""" + input_patterns = input_path.strip().split(',') + all_matched_files = [] + for input_pattern in input_patterns: + input_pattern = input_pattern.strip() + if not input_pattern: + continue + matched_files = tf.io.gfile.glob(input_pattern) + if not matched_files: + raise ValueError('%s does not match any files.' % input_pattern) + else: + all_matched_files.extend(matched_files) + return sorted(all_matched_files) + + def predict_squad(strategy, input_meta_data, tokenizer, @@ -379,11 +405,24 @@ def predict_squad(strategy, """Get prediction results and evaluate them to hard drive.""" if init_checkpoint is None: init_checkpoint = tf.train.latest_checkpoint(FLAGS.model_dir) - all_predictions, all_nbest_json, scores_diff_json = prediction_output_squad( - strategy, input_meta_data, tokenizer, - bert_config, squad_lib, init_checkpoint) - dump_to_files(all_predictions, all_nbest_json, scores_diff_json, squad_lib, - input_meta_data.get('version_2_with_negative', False)) + + all_predict_files = _get_matched_files(FLAGS.predict_file) + squad_model = get_squad_model_to_predict(strategy, bert_config, + init_checkpoint, input_meta_data) + for idx, predict_file in enumerate(all_predict_files): + all_predictions, all_nbest_json, scores_diff_json = prediction_output_squad( + strategy, input_meta_data, tokenizer, squad_lib, predict_file, + squad_model) + if len(all_predict_files) == 1: + file_prefix = '' + else: + # if predict_file is /path/xquad.ar.json, the `file_prefix` may be + # "xquad.ar-0-" + file_prefix = '%s-' % os.path.splitext( + os.path.basename(all_predict_files[idx]))[0] + dump_to_files(all_predictions, all_nbest_json, scores_diff_json, squad_lib, + input_meta_data.get('version_2_with_negative', False), + file_prefix) def eval_squad(strategy, @@ -395,9 +434,17 @@ def eval_squad(strategy, """Get prediction results and evaluate them against ground truth.""" if init_checkpoint is None: init_checkpoint = tf.train.latest_checkpoint(FLAGS.model_dir) + + all_predict_files = _get_matched_files(FLAGS.predict_file) + if len(all_predict_files) != 1: + raise ValueError('`eval_squad` only supports one predict file, ' + 'but got %s' % all_predict_files) + + squad_model = get_squad_model_to_predict(strategy, bert_config, + init_checkpoint, input_meta_data) all_predictions, all_nbest_json, scores_diff_json = prediction_output_squad( - strategy, input_meta_data, tokenizer, - bert_config, squad_lib, init_checkpoint) + strategy, input_meta_data, tokenizer, squad_lib, all_predict_files[0], + squad_model) dump_to_files(all_predictions, all_nbest_json, scores_diff_json, squad_lib, input_meta_data.get('version_2_with_negative', False)) diff --git a/official/nlp/bert/tf2_encoder_checkpoint_converter.py b/official/nlp/bert/tf2_encoder_checkpoint_converter.py index 2faf6ea2cfb9f0d71d0a79dff101e0408fa41778..835a152f7ca54c32200b2aed6481a546cab366dc 100644 --- a/official/nlp/bert/tf2_encoder_checkpoint_converter.py +++ b/official/nlp/bert/tf2_encoder_checkpoint_converter.py @@ -61,7 +61,7 @@ def _create_bert_model(cfg): activation=activations.gelu, dropout_rate=cfg.hidden_dropout_prob, attention_dropout_rate=cfg.attention_probs_dropout_prob, - sequence_length=cfg.max_position_embeddings, + max_sequence_length=cfg.max_position_embeddings, type_vocab_size=cfg.type_vocab_size, initializer=tf.keras.initializers.TruncatedNormal( stddev=cfg.initializer_range), @@ -73,6 +73,7 @@ def _create_bert_model(cfg): def convert_checkpoint(bert_config, output_path, v1_checkpoint): """Converts a V1 checkpoint into an OO V2 checkpoint.""" output_dir, _ = os.path.split(output_path) + tf.io.gfile.makedirs(output_dir) # Create a temporary V1 name-converted checkpoint in the output directory. temporary_checkpoint_dir = os.path.join(output_dir, "temp_v1") diff --git a/official/nlp/configs/bert.py b/official/nlp/configs/bert.py index 058af898f51c99ccf35114b5bff480995b8a580d..fad49e29debd0864448b00899725b55101c8f293 100644 --- a/official/nlp/configs/bert.py +++ b/official/nlp/configs/bert.py @@ -13,7 +13,10 @@ # See the License for the specific language governing permissions and # limitations under the License. # ============================================================================== -"""A multi-head BERT encoder network for pretraining.""" +"""Multi-head BERT encoder network with classification heads. + +Includes configurations and instantiation methods. +""" from typing import List, Optional, Text import dataclasses @@ -21,10 +24,8 @@ import tensorflow as tf from official.modeling import tf_utils from official.modeling.hyperparams import base_config -from official.modeling.hyperparams import config_definitions as cfg from official.nlp.configs import encoders from official.nlp.modeling import layers -from official.nlp.modeling import networks from official.nlp.modeling.models import bert_pretrainer @@ -41,80 +42,30 @@ class ClsHeadConfig(base_config.Config): @dataclasses.dataclass class BertPretrainerConfig(base_config.Config): """BERT encoder configuration.""" - num_masked_tokens: int = 76 encoder: encoders.TransformerEncoderConfig = ( encoders.TransformerEncoderConfig()) cls_heads: List[ClsHeadConfig] = dataclasses.field(default_factory=list) -def instantiate_from_cfg( +def instantiate_classification_heads_from_cfgs( + cls_head_configs: List[ClsHeadConfig]) -> List[layers.ClassificationHead]: + return [ + layers.ClassificationHead(**cfg.as_dict()) for cfg in cls_head_configs + ] if cls_head_configs else [] + + +def instantiate_pretrainer_from_cfg( config: BertPretrainerConfig, - encoder_network: Optional[tf.keras.Model] = None): + encoder_network: Optional[tf.keras.Model] = None +) -> bert_pretrainer.BertPretrainerV2: """Instantiates a BertPretrainer from the config.""" encoder_cfg = config.encoder if encoder_network is None: - encoder_network = networks.TransformerEncoder( - vocab_size=encoder_cfg.vocab_size, - hidden_size=encoder_cfg.hidden_size, - num_layers=encoder_cfg.num_layers, - num_attention_heads=encoder_cfg.num_attention_heads, - intermediate_size=encoder_cfg.intermediate_size, - activation=tf_utils.get_activation(encoder_cfg.hidden_activation), - dropout_rate=encoder_cfg.dropout_rate, - attention_dropout_rate=encoder_cfg.attention_dropout_rate, - max_sequence_length=encoder_cfg.max_position_embeddings, - type_vocab_size=encoder_cfg.type_vocab_size, - initializer=tf.keras.initializers.TruncatedNormal( - stddev=encoder_cfg.initializer_range)) - if config.cls_heads: - classification_heads = [ - layers.ClassificationHead(**cfg.as_dict()) for cfg in config.cls_heads - ] - else: - classification_heads = [] + encoder_network = encoders.instantiate_encoder_from_cfg(encoder_cfg) return bert_pretrainer.BertPretrainerV2( - config.num_masked_tokens, mlm_activation=tf_utils.get_activation(encoder_cfg.hidden_activation), mlm_initializer=tf.keras.initializers.TruncatedNormal( stddev=encoder_cfg.initializer_range), encoder_network=encoder_network, - classification_heads=classification_heads) - - -@dataclasses.dataclass -class BertPretrainDataConfig(cfg.DataConfig): - """Data config for BERT pretraining task.""" - input_path: str = "" - global_batch_size: int = 512 - is_training: bool = True - seq_length: int = 512 - max_predictions_per_seq: int = 76 - use_next_sentence_label: bool = True - use_position_id: bool = False - - -@dataclasses.dataclass -class BertPretrainEvalDataConfig(BertPretrainDataConfig): - """Data config for the eval set in BERT pretraining task.""" - input_path: str = "" - global_batch_size: int = 512 - is_training: bool = False - - -@dataclasses.dataclass -class BertSentencePredictionDataConfig(cfg.DataConfig): - """Data of sentence prediction dataset.""" - input_path: str = "" - global_batch_size: int = 32 - is_training: bool = True - seq_length: int = 128 - - -@dataclasses.dataclass -class BertSentencePredictionDevDataConfig(cfg.DataConfig): - """Dev data of MNLI sentence prediction dataset.""" - input_path: str = "" - global_batch_size: int = 32 - is_training: bool = False - seq_length: int = 128 - drop_remainder: bool = False + classification_heads=instantiate_classification_heads_from_cfgs( + config.cls_heads)) diff --git a/official/nlp/configs/bert_test.py b/official/nlp/configs/bert_test.py index 199608cd05ab6a83d92edbcf5154aa7b33c8dfd0..871ab45373c430667f2cf45f93492947aaa3c4e9 100644 --- a/official/nlp/configs/bert_test.py +++ b/official/nlp/configs/bert_test.py @@ -26,7 +26,7 @@ class BertModelsTest(tf.test.TestCase): def test_network_invocation(self): config = bert.BertPretrainerConfig( encoder=encoders.TransformerEncoderConfig(vocab_size=10, num_layers=1)) - _ = bert.instantiate_from_cfg(config) + _ = bert.instantiate_pretrainer_from_cfg(config) # Invokes with classification heads. config = bert.BertPretrainerConfig( @@ -35,7 +35,7 @@ class BertModelsTest(tf.test.TestCase): bert.ClsHeadConfig( inner_dim=10, num_classes=2, name="next_sentence") ]) - _ = bert.instantiate_from_cfg(config) + _ = bert.instantiate_pretrainer_from_cfg(config) with self.assertRaises(ValueError): config = bert.BertPretrainerConfig( @@ -47,7 +47,7 @@ class BertModelsTest(tf.test.TestCase): bert.ClsHeadConfig( inner_dim=10, num_classes=2, name="next_sentence") ]) - _ = bert.instantiate_from_cfg(config) + _ = bert.instantiate_pretrainer_from_cfg(config) def test_checkpoint_items(self): config = bert.BertPretrainerConfig( @@ -56,9 +56,10 @@ class BertModelsTest(tf.test.TestCase): bert.ClsHeadConfig( inner_dim=10, num_classes=2, name="next_sentence") ]) - encoder = bert.instantiate_from_cfg(config) - self.assertSameElements(encoder.checkpoint_items.keys(), - ["encoder", "next_sentence.pooler_dense"]) + encoder = bert.instantiate_pretrainer_from_cfg(config) + self.assertSameElements( + encoder.checkpoint_items.keys(), + ["encoder", "masked_lm", "next_sentence.pooler_dense"]) if __name__ == "__main__": diff --git a/official/nlp/configs/electra.py b/official/nlp/configs/electra.py new file mode 100644 index 0000000000000000000000000000000000000000..61fd82db702364ffe6baf8fad1c8b3ae17d09120 --- /dev/null +++ b/official/nlp/configs/electra.py @@ -0,0 +1,91 @@ +# Lint as: python3 +# Copyright 2020 The TensorFlow Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# ============================================================================== +"""ELECTRA model configurations and instantiation methods.""" +from typing import List, Optional + +import dataclasses +import tensorflow as tf + +from official.modeling import tf_utils +from official.modeling.hyperparams import base_config +from official.nlp.configs import bert +from official.nlp.configs import encoders +from official.nlp.modeling import layers +from official.nlp.modeling.models import electra_pretrainer + + +@dataclasses.dataclass +class ELECTRAPretrainerConfig(base_config.Config): + """ELECTRA pretrainer configuration.""" + num_masked_tokens: int = 76 + sequence_length: int = 512 + num_classes: int = 2 + discriminator_loss_weight: float = 50.0 + tie_embeddings: bool = True + disallow_correct: bool = False + generator_encoder: encoders.TransformerEncoderConfig = ( + encoders.TransformerEncoderConfig()) + discriminator_encoder: encoders.TransformerEncoderConfig = ( + encoders.TransformerEncoderConfig()) + cls_heads: List[bert.ClsHeadConfig] = dataclasses.field(default_factory=list) + + +def instantiate_classification_heads_from_cfgs( + cls_head_configs: List[bert.ClsHeadConfig] +) -> List[layers.ClassificationHead]: + if cls_head_configs: + return [ + layers.ClassificationHead(**cfg.as_dict()) for cfg in cls_head_configs + ] + else: + return [] + + +def instantiate_pretrainer_from_cfg( + config: ELECTRAPretrainerConfig, + generator_network: Optional[tf.keras.Model] = None, + discriminator_network: Optional[tf.keras.Model] = None, + ) -> electra_pretrainer.ElectraPretrainer: + """Instantiates ElectraPretrainer from the config.""" + generator_encoder_cfg = config.generator_encoder + discriminator_encoder_cfg = config.discriminator_encoder + # Copy discriminator's embeddings to generator for easier model serialization. + if discriminator_network is None: + discriminator_network = encoders.instantiate_encoder_from_cfg( + discriminator_encoder_cfg) + if generator_network is None: + if config.tie_embeddings: + embedding_layer = discriminator_network.get_embedding_layer() + generator_network = encoders.instantiate_encoder_from_cfg( + generator_encoder_cfg, embedding_layer=embedding_layer) + else: + generator_network = encoders.instantiate_encoder_from_cfg( + generator_encoder_cfg) + + return electra_pretrainer.ElectraPretrainer( + generator_network=generator_network, + discriminator_network=discriminator_network, + vocab_size=config.generator_encoder.vocab_size, + num_classes=config.num_classes, + sequence_length=config.sequence_length, + num_token_predictions=config.num_masked_tokens, + mlm_activation=tf_utils.get_activation( + generator_encoder_cfg.hidden_activation), + mlm_initializer=tf.keras.initializers.TruncatedNormal( + stddev=generator_encoder_cfg.initializer_range), + classification_heads=instantiate_classification_heads_from_cfgs( + config.cls_heads), + disallow_correct=config.disallow_correct) diff --git a/official/nlp/configs/electra_test.py b/official/nlp/configs/electra_test.py new file mode 100644 index 0000000000000000000000000000000000000000..d06d64a95d6ef987cdb34a471521853001f11339 --- /dev/null +++ b/official/nlp/configs/electra_test.py @@ -0,0 +1,49 @@ +# Lint as: python3 +# Copyright 2020 The TensorFlow Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# ============================================================================== +"""Tests for ELECTRA configurations and models instantiation.""" + +import tensorflow as tf + +from official.nlp.configs import bert +from official.nlp.configs import electra +from official.nlp.configs import encoders + + +class ELECTRAModelsTest(tf.test.TestCase): + + def test_network_invocation(self): + config = electra.ELECTRAPretrainerConfig( + generator_encoder=encoders.TransformerEncoderConfig( + vocab_size=10, num_layers=1), + discriminator_encoder=encoders.TransformerEncoderConfig( + vocab_size=10, num_layers=2), + ) + _ = electra.instantiate_pretrainer_from_cfg(config) + + # Invokes with classification heads. + config = electra.ELECTRAPretrainerConfig( + generator_encoder=encoders.TransformerEncoderConfig( + vocab_size=10, num_layers=1), + discriminator_encoder=encoders.TransformerEncoderConfig( + vocab_size=10, num_layers=2), + cls_heads=[ + bert.ClsHeadConfig( + inner_dim=10, num_classes=2, name="next_sentence") + ]) + _ = electra.instantiate_pretrainer_from_cfg(config) + +if __name__ == "__main__": + tf.test.main() diff --git a/official/nlp/configs/encoders.py b/official/nlp/configs/encoders.py index 146879a9552fb8177734f7eebb4e49437cfb4d3e..b7467634a36adf72952481faacbce4852cd7feb7 100644 --- a/official/nlp/configs/encoders.py +++ b/official/nlp/configs/encoders.py @@ -13,11 +13,18 @@ # See the License for the specific language governing permissions and # limitations under the License. # ============================================================================== -"""Configurations for Encoders.""" +"""Transformer Encoders. +Includes configurations and instantiation methods. +""" +from typing import Optional import dataclasses +import tensorflow as tf +from official.modeling import tf_utils from official.modeling.hyperparams import base_config +from official.nlp.modeling import layers +from official.nlp.modeling import networks @dataclasses.dataclass @@ -28,9 +35,64 @@ class TransformerEncoderConfig(base_config.Config): num_layers: int = 12 num_attention_heads: int = 12 hidden_activation: str = "gelu" - intermediate_size: int = 3076 + intermediate_size: int = 3072 dropout_rate: float = 0.1 attention_dropout_rate: float = 0.1 max_position_embeddings: int = 512 type_vocab_size: int = 2 initializer_range: float = 0.02 + embedding_size: Optional[int] = None + + +def instantiate_encoder_from_cfg( + config: TransformerEncoderConfig, + encoder_cls=networks.TransformerEncoder, + embedding_layer: Optional[layers.OnDeviceEmbedding] = None): + """Instantiate a Transformer encoder network from TransformerEncoderConfig.""" + if encoder_cls.__name__ == "EncoderScaffold": + embedding_cfg = dict( + vocab_size=config.vocab_size, + type_vocab_size=config.type_vocab_size, + hidden_size=config.hidden_size, + max_seq_length=config.max_position_embeddings, + initializer=tf.keras.initializers.TruncatedNormal( + stddev=config.initializer_range), + dropout_rate=config.dropout_rate, + ) + hidden_cfg = dict( + num_attention_heads=config.num_attention_heads, + intermediate_size=config.intermediate_size, + intermediate_activation=tf_utils.get_activation( + config.hidden_activation), + dropout_rate=config.dropout_rate, + attention_dropout_rate=config.attention_dropout_rate, + kernel_initializer=tf.keras.initializers.TruncatedNormal( + stddev=config.initializer_range), + ) + kwargs = dict( + embedding_cfg=embedding_cfg, + hidden_cfg=hidden_cfg, + num_hidden_instances=config.num_layers, + pooled_output_dim=config.hidden_size, + pooler_layer_initializer=tf.keras.initializers.TruncatedNormal( + stddev=config.initializer_range)) + return encoder_cls(**kwargs) + + if encoder_cls.__name__ != "TransformerEncoder": + raise ValueError("Unknown encoder network class. %s" % str(encoder_cls)) + encoder_network = encoder_cls( + vocab_size=config.vocab_size, + hidden_size=config.hidden_size, + num_layers=config.num_layers, + num_attention_heads=config.num_attention_heads, + intermediate_size=config.intermediate_size, + activation=tf_utils.get_activation(config.hidden_activation), + dropout_rate=config.dropout_rate, + attention_dropout_rate=config.attention_dropout_rate, + max_sequence_length=config.max_position_embeddings, + type_vocab_size=config.type_vocab_size, + initializer=tf.keras.initializers.TruncatedNormal( + stddev=config.initializer_range), + embedding_width=config.embedding_size, + embedding_layer=embedding_layer) + return encoder_network diff --git a/official/nlp/data/classifier_data_lib.py b/official/nlp/data/classifier_data_lib.py index ce17edc1f4d83eb1fa2fb305303412b77384ff9b..09f5863c19156ef601197acdc1ab0b10fe2d699c 100644 --- a/official/nlp/data/classifier_data_lib.py +++ b/official/nlp/data/classifier_data_lib.py @@ -31,9 +31,15 @@ from official.nlp.bert import tokenization class InputExample(object): - """A single training/test example for simple sequence classification.""" + """A single training/test example for simple seq regression/classification.""" - def __init__(self, guid, text_a, text_b=None, label=None, weight=None): + def __init__(self, + guid, + text_a, + text_b=None, + label=None, + weight=None, + int_iden=None): """Constructs a InputExample. Args: @@ -42,16 +48,20 @@ class InputExample(object): sequence tasks, only this sequence must be specified. text_b: (Optional) string. The untokenized text of the second sequence. Only must be specified for sequence pair tasks. - label: (Optional) string. The label of the example. This should be - specified for train and dev examples, but not for test examples. + label: (Optional) string for classification, float for regression. The + label of the example. This should be specified for train and dev + examples, but not for test examples. weight: (Optional) float. The weight of the example to be used during training. + int_iden: (Optional) int. The int identification number of example in the + corpus. """ self.guid = guid self.text_a = text_a self.text_b = text_b self.label = label self.weight = weight + self.int_iden = int_iden class InputFeatures(object): @@ -63,20 +73,24 @@ class InputFeatures(object): segment_ids, label_id, is_real_example=True, - weight=None): + weight=None, + int_iden=None): self.input_ids = input_ids self.input_mask = input_mask self.segment_ids = segment_ids self.label_id = label_id self.is_real_example = is_real_example self.weight = weight + self.int_iden = int_iden class DataProcessor(object): - """Base class for data converters for sequence classification data sets.""" + """Base class for converters for seq regression/classification datasets.""" def __init__(self, process_text_fn=tokenization.convert_to_unicode): self.process_text_fn = process_text_fn + self.is_regression = False + self.label_type = None def get_train_examples(self, data_dir): """Gets a collection of `InputExample`s for the train set.""" @@ -110,92 +124,163 @@ class DataProcessor(object): return lines -class XnliProcessor(DataProcessor): - """Processor for the XNLI data set.""" - supported_languages = [ - "ar", "bg", "de", "el", "en", "es", "fr", "hi", "ru", "sw", "th", "tr", - "ur", "vi", "zh" - ] - - def __init__(self, - language="en", - process_text_fn=tokenization.convert_to_unicode): - super(XnliProcessor, self).__init__(process_text_fn) - if language == "all": - self.languages = XnliProcessor.supported_languages - elif language not in XnliProcessor.supported_languages: - raise ValueError("language %s is not supported for XNLI task." % language) - else: - self.languages = [language] +class ColaProcessor(DataProcessor): + """Processor for the CoLA data set (GLUE version).""" def get_train_examples(self, data_dir): """See base class.""" - lines = [] - for language in self.languages: - # Skips the header. - lines.extend( - self._read_tsv( - os.path.join(data_dir, "multinli", - "multinli.train.%s.tsv" % language))[1:]) + return self._create_examples( + self._read_tsv(os.path.join(data_dir, "train.tsv")), "train") + + def get_dev_examples(self, data_dir): + """See base class.""" + return self._create_examples( + self._read_tsv(os.path.join(data_dir, "dev.tsv")), "dev") + + def get_test_examples(self, data_dir): + """See base class.""" + return self._create_examples( + self._read_tsv(os.path.join(data_dir, "test.tsv")), "test") + + def get_labels(self): + """See base class.""" + return ["0", "1"] + + @staticmethod + def get_processor_name(): + """See base class.""" + return "COLA" + def _create_examples(self, lines, set_type): + """Creates examples for the training/dev/test sets.""" examples = [] - for (i, line) in enumerate(lines): - guid = "train-%d" % i - text_a = self.process_text_fn(line[0]) - text_b = self.process_text_fn(line[1]) - label = self.process_text_fn(line[2]) - if label == self.process_text_fn("contradictory"): - label = self.process_text_fn("contradiction") + for i, line in enumerate(lines): + # Only the test set has a header. + if set_type == "test" and i == 0: + continue + guid = "%s-%s" % (set_type, i) + if set_type == "test": + text_a = self.process_text_fn(line[1]) + label = "0" + else: + text_a = self.process_text_fn(line[3]) + label = self.process_text_fn(line[1]) examples.append( - InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label)) + InputExample(guid=guid, text_a=text_a, text_b=None, label=label)) return examples + +class MnliProcessor(DataProcessor): + """Processor for the MultiNLI data set (GLUE version).""" + + def __init__(self, + mnli_type="matched", + process_text_fn=tokenization.convert_to_unicode): + super(MnliProcessor, self).__init__(process_text_fn) + if mnli_type not in ("matched", "mismatched"): + raise ValueError("Invalid `mnli_type`: %s" % mnli_type) + self.mnli_type = mnli_type + + def get_train_examples(self, data_dir): + """See base class.""" + return self._create_examples( + self._read_tsv(os.path.join(data_dir, "train.tsv")), "train") + def get_dev_examples(self, data_dir): """See base class.""" - lines = self._read_tsv(os.path.join(data_dir, "xnli.dev.tsv")) + if self.mnli_type == "matched": + return self._create_examples( + self._read_tsv(os.path.join(data_dir, "dev_matched.tsv")), + "dev_matched") + else: + return self._create_examples( + self._read_tsv(os.path.join(data_dir, "dev_mismatched.tsv")), + "dev_mismatched") + + def get_test_examples(self, data_dir): + """See base class.""" + if self.mnli_type == "matched": + return self._create_examples( + self._read_tsv(os.path.join(data_dir, "test_matched.tsv")), "test") + else: + return self._create_examples( + self._read_tsv(os.path.join(data_dir, "test_mismatched.tsv")), "test") + + def get_labels(self): + """See base class.""" + return ["contradiction", "entailment", "neutral"] + + @staticmethod + def get_processor_name(): + """See base class.""" + return "MNLI" + + def _create_examples(self, lines, set_type): + """Creates examples for the training/dev/test sets.""" examples = [] - for (i, line) in enumerate(lines): + for i, line in enumerate(lines): if i == 0: continue - guid = "dev-%d" % i - text_a = self.process_text_fn(line[6]) - text_b = self.process_text_fn(line[7]) - label = self.process_text_fn(line[1]) + guid = "%s-%s" % (set_type, self.process_text_fn(line[0])) + text_a = self.process_text_fn(line[8]) + text_b = self.process_text_fn(line[9]) + if set_type == "test": + label = "contradiction" + else: + label = self.process_text_fn(line[-1]) examples.append( InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label)) return examples + +class MrpcProcessor(DataProcessor): + """Processor for the MRPC data set (GLUE version).""" + + def get_train_examples(self, data_dir): + """See base class.""" + return self._create_examples( + self._read_tsv(os.path.join(data_dir, "train.tsv")), "train") + + def get_dev_examples(self, data_dir): + """See base class.""" + return self._create_examples( + self._read_tsv(os.path.join(data_dir, "dev.tsv")), "dev") + def get_test_examples(self, data_dir): """See base class.""" - lines = self._read_tsv(os.path.join(data_dir, "xnli.test.tsv")) - examples_by_lang = {k: [] for k in XnliProcessor.supported_languages} - for (i, line) in enumerate(lines): - if i == 0: - continue - guid = "test-%d" % i - language = self.process_text_fn(line[0]) - text_a = self.process_text_fn(line[6]) - text_b = self.process_text_fn(line[7]) - label = self.process_text_fn(line[1]) - examples_by_lang[language].append( - InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label)) - return examples_by_lang + return self._create_examples( + self._read_tsv(os.path.join(data_dir, "test.tsv")), "test") def get_labels(self): """See base class.""" - return ["contradiction", "entailment", "neutral"] + return ["0", "1"] @staticmethod def get_processor_name(): """See base class.""" - return "XNLI" + return "MRPC" + + def _create_examples(self, lines, set_type): + """Creates examples for the training/dev/test sets.""" + examples = [] + for i, line in enumerate(lines): + if i == 0: + continue + guid = "%s-%s" % (set_type, i) + text_a = self.process_text_fn(line[3]) + text_b = self.process_text_fn(line[4]) + if set_type == "test": + label = "0" + else: + label = self.process_text_fn(line[0]) + examples.append( + InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label)) + return examples class PawsxProcessor(DataProcessor): """Processor for the PAWS-X data set.""" - supported_languages = [ - "de", "en", "es", "fr", "ja", "ko", "zh" - ] + supported_languages = ["de", "en", "es", "fr", "ja", "ko", "zh"] def __init__(self, language="en", @@ -219,11 +304,10 @@ class PawsxProcessor(DataProcessor): train_tsv = "translated_train.tsv" # Skips the header. lines.extend( - self._read_tsv( - os.path.join(data_dir, language, train_tsv))[1:]) + self._read_tsv(os.path.join(data_dir, language, train_tsv))[1:]) examples = [] - for (i, line) in enumerate(lines): + for i, line in enumerate(lines): guid = "train-%d" % i text_a = self.process_text_fn(line[1]) text_b = self.process_text_fn(line[2]) @@ -235,13 +319,12 @@ class PawsxProcessor(DataProcessor): def get_dev_examples(self, data_dir): """See base class.""" lines = [] - for language in PawsxProcessor.supported_languages: - # Skips the header. + for lang in PawsxProcessor.supported_languages: lines.extend( - self._read_tsv(os.path.join(data_dir, language, "dev_2k.tsv"))[1:]) + self._read_tsv(os.path.join(data_dir, lang, "dev_2k.tsv"))[1:]) examples = [] - for (i, line) in enumerate(lines): + for i, line in enumerate(lines): guid = "dev-%d" % i text_a = self.process_text_fn(line[1]) text_b = self.process_text_fn(line[2]) @@ -252,17 +335,15 @@ class PawsxProcessor(DataProcessor): def get_test_examples(self, data_dir): """See base class.""" - examples_by_lang = {k: [] for k in PawsxProcessor.supported_languages} - for language in PawsxProcessor.supported_languages: - lines = self._read_tsv(os.path.join(data_dir, language, "test_2k.tsv")) - for (i, line) in enumerate(lines): - if i == 0: - continue + examples_by_lang = {k: [] for k in self.supported_languages} + for lang in self.supported_languages: + lines = self._read_tsv(os.path.join(data_dir, lang, "test_2k.tsv"))[1:] + for i, line in enumerate(lines): guid = "test-%d" % i text_a = self.process_text_fn(line[1]) text_b = self.process_text_fn(line[2]) label = self.process_text_fn(line[3]) - examples_by_lang[language].append( + examples_by_lang[lang].append( InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label)) return examples_by_lang @@ -273,57 +354,11 @@ class PawsxProcessor(DataProcessor): @staticmethod def get_processor_name(): """See base class.""" - return "PAWS-X" - - -class MnliProcessor(DataProcessor): - """Processor for the MultiNLI data set (GLUE version).""" - - def get_train_examples(self, data_dir): - """See base class.""" - return self._create_examples( - self._read_tsv(os.path.join(data_dir, "train.tsv")), "train") + return "XTREME-PAWS-X" - def get_dev_examples(self, data_dir): - """See base class.""" - return self._create_examples( - self._read_tsv(os.path.join(data_dir, "dev_matched.tsv")), - "dev_matched") - def get_test_examples(self, data_dir): - """See base class.""" - return self._create_examples( - self._read_tsv(os.path.join(data_dir, "test_matched.tsv")), "test") - - def get_labels(self): - """See base class.""" - return ["contradiction", "entailment", "neutral"] - - @staticmethod - def get_processor_name(): - """See base class.""" - return "MNLI" - - def _create_examples(self, lines, set_type): - """Creates examples for the training and dev sets.""" - examples = [] - for (i, line) in enumerate(lines): - if i == 0: - continue - guid = "%s-%s" % (set_type, self.process_text_fn(line[0])) - text_a = self.process_text_fn(line[8]) - text_b = self.process_text_fn(line[9]) - if set_type == "test": - label = "contradiction" - else: - label = self.process_text_fn(line[-1]) - examples.append( - InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label)) - return examples - - -class MrpcProcessor(DataProcessor): - """Processor for the MRPC data set (GLUE version).""" +class QnliProcessor(DataProcessor): + """Processor for the QNLI data set (GLUE version).""" def get_train_examples(self, data_dir): """See base class.""" @@ -333,7 +368,7 @@ class MrpcProcessor(DataProcessor): def get_dev_examples(self, data_dir): """See base class.""" return self._create_examples( - self._read_tsv(os.path.join(data_dir, "dev.tsv")), "dev") + self._read_tsv(os.path.join(data_dir, "dev.tsv")), "dev_matched") def get_test_examples(self, data_dir): """See base class.""" @@ -342,26 +377,28 @@ class MrpcProcessor(DataProcessor): def get_labels(self): """See base class.""" - return ["0", "1"] + return ["entailment", "not_entailment"] @staticmethod def get_processor_name(): """See base class.""" - return "MRPC" + return "QNLI" def _create_examples(self, lines, set_type): - """Creates examples for the training and dev sets.""" + """Creates examples for the training/dev/test sets.""" examples = [] - for (i, line) in enumerate(lines): + for i, line in enumerate(lines): if i == 0: continue - guid = "%s-%s" % (set_type, i) - text_a = self.process_text_fn(line[3]) - text_b = self.process_text_fn(line[4]) + guid = "%s-%s" % (set_type, 1) if set_type == "test": - label = "0" + text_a = tokenization.convert_to_unicode(line[1]) + text_b = tokenization.convert_to_unicode(line[2]) + label = "entailment" else: - label = self.process_text_fn(line[0]) + text_a = tokenization.convert_to_unicode(line[1]) + text_b = tokenization.convert_to_unicode(line[2]) + label = tokenization.convert_to_unicode(line[-1]) examples.append( InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label)) return examples @@ -395,9 +432,9 @@ class QqpProcessor(DataProcessor): return "QQP" def _create_examples(self, lines, set_type): - """Creates examples for the training and dev sets.""" + """Creates examples for the training/dev/test sets.""" examples = [] - for (i, line) in enumerate(lines): + for i, line in enumerate(lines): if i == 0: continue guid = "%s-%s" % (set_type, line[0]) @@ -407,13 +444,13 @@ class QqpProcessor(DataProcessor): label = line[5] except IndexError: continue - examples.append(InputExample(guid=guid, text_a=text_a, text_b=text_b, - label=label)) + examples.append( + InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label)) return examples -class ColaProcessor(DataProcessor): - """Processor for the CoLA data set (GLUE version).""" +class RteProcessor(DataProcessor): + """Processor for the RTE data set (GLUE version).""" def get_train_examples(self, data_dir): """See base class.""" @@ -432,29 +469,30 @@ class ColaProcessor(DataProcessor): def get_labels(self): """See base class.""" - return ["0", "1"] + # All datasets are converted to 2-class split, where for 3-class datasets we + # collapse neutral and contradiction into not_entailment. + return ["entailment", "not_entailment"] @staticmethod def get_processor_name(): """See base class.""" - return "COLA" + return "RTE" def _create_examples(self, lines, set_type): - """Creates examples for the training and dev sets.""" + """Creates examples for the training/dev/test sets.""" examples = [] - for (i, line) in enumerate(lines): - # Only the test set has a header - if set_type == "test" and i == 0: + for i, line in enumerate(lines): + if i == 0: continue guid = "%s-%s" % (set_type, i) + text_a = tokenization.convert_to_unicode(line[1]) + text_b = tokenization.convert_to_unicode(line[2]) if set_type == "test": - text_a = self.process_text_fn(line[1]) - label = "0" + label = "entailment" else: - text_a = self.process_text_fn(line[3]) - label = self.process_text_fn(line[1]) + label = tokenization.convert_to_unicode(line[3]) examples.append( - InputExample(guid=guid, text_a=text_a, text_b=None, label=label)) + InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label)) return examples @@ -486,9 +524,9 @@ class SstProcessor(DataProcessor): return "SST-2" def _create_examples(self, lines, set_type): - """Creates examples for the training and dev sets.""" + """Creates examples for the training/dev/test sets.""" examples = [] - for (i, line) in enumerate(lines): + for i, line in enumerate(lines): if i == 0: continue guid = "%s-%s" % (set_type, i) @@ -503,8 +541,14 @@ class SstProcessor(DataProcessor): return examples -class QnliProcessor(DataProcessor): - """Processor for the QNLI data set (GLUE version).""" +class StsBProcessor(DataProcessor): + """Processor for the STS-B data set (GLUE version).""" + + def __init__(self, process_text_fn=tokenization.convert_to_unicode): + super(StsBProcessor, self).__init__(process_text_fn=process_text_fn) + self.is_regression = True + self.label_type = float + self._labels = None def get_train_examples(self, data_dir): """See base class.""" @@ -514,7 +558,7 @@ class QnliProcessor(DataProcessor): def get_dev_examples(self, data_dir): """See base class.""" return self._create_examples( - self._read_tsv(os.path.join(data_dir, "dev.tsv")), "dev_matched") + self._read_tsv(os.path.join(data_dir, "dev.tsv")), "dev") def get_test_examples(self, data_dir): """See base class.""" @@ -523,28 +567,26 @@ class QnliProcessor(DataProcessor): def get_labels(self): """See base class.""" - return ["entailment", "not_entailment"] + return self._labels @staticmethod def get_processor_name(): """See base class.""" - return "QNLI" + return "STS-B" def _create_examples(self, lines, set_type): - """Creates examples for the training and dev sets.""" + """Creates examples for the training/dev/test sets.""" examples = [] - for (i, line) in enumerate(lines): + for i, line in enumerate(lines): if i == 0: continue - guid = "%s-%s" % (set_type, 1) + guid = "%s-%s" % (set_type, i) + text_a = tokenization.convert_to_unicode(line[7]) + text_b = tokenization.convert_to_unicode(line[8]) if set_type == "test": - text_a = tokenization.convert_to_unicode(line[1]) - text_b = tokenization.convert_to_unicode(line[2]) - label = "entailment" + label = 0.0 else: - text_a = tokenization.convert_to_unicode(line[1]) - text_b = tokenization.convert_to_unicode(line[2]) - label = tokenization.convert_to_unicode(line[-1]) + label = self.label_type(tokenization.convert_to_unicode(line[9])) examples.append( InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label)) return examples @@ -564,6 +606,8 @@ class TfdsProcessor(DataProcessor): tfds_params="dataset=glue/mrpc,text_key=sentence1,text_b_key=sentence2" tfds_params="dataset=glue/stsb,text_key=sentence1,text_b_key=sentence2," "is_regression=true,label_type=float" + tfds_params="dataset=snli,text_key=premise,text_b_key=hypothesis," + "skip_label=-1" Possible parameters (please refer to the documentation of Tensorflow Datasets (TFDS) for the meaning of individual parameters): dataset: Required dataset name (potentially with subset and version number). @@ -581,17 +625,19 @@ class TfdsProcessor(DataProcessor): label_type: Type of the label key (defaults to `int`). weight_key: Key of the float sample weight (is not used if not provided). is_regression: Whether the task is a regression problem (defaults to False). + skip_label: Skip examples with given label (defaults to None). """ - def __init__(self, tfds_params, + def __init__(self, + tfds_params, process_text_fn=tokenization.convert_to_unicode): super(TfdsProcessor, self).__init__(process_text_fn) self._process_tfds_params_str(tfds_params) if self.module_import: importlib.import_module(self.module_import) - self.dataset, info = tfds.load(self.dataset_name, data_dir=self.data_dir, - with_info=True) + self.dataset, info = tfds.load( + self.dataset_name, data_dir=self.data_dir, with_info=True) if self.is_regression: self._labels = None else: @@ -619,6 +665,9 @@ class TfdsProcessor(DataProcessor): self.label_type = dtype_map[d.get("label_type", "int")] self.is_regression = cast_str_to_bool(d.get("is_regression", "False")) self.weight_key = d.get("weight_key", None) + self.skip_label = d.get("skip_label", None) + if self.skip_label is not None: + self.skip_label = self.label_type(self.skip_label) def get_train_examples(self, data_dir): assert data_dir is None @@ -639,7 +688,7 @@ class TfdsProcessor(DataProcessor): return "TFDS_" + self.dataset_name def _create_examples(self, split_name, set_type): - """Creates examples for the training and dev sets.""" + """Creates examples for the training/dev/test sets.""" if split_name not in self.dataset: raise ValueError("Split {} not available.".format(split_name)) dataset = self.dataset[split_name].as_numpy_iterator() @@ -657,13 +706,258 @@ class TfdsProcessor(DataProcessor): if self.text_b_key: text_b = self.process_text_fn(example[self.text_b_key]) label = self.label_type(example[self.label_key]) + if self.skip_label is not None and label == self.skip_label: + continue if self.weight_key: weight = float(example[self.weight_key]) examples.append( - InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label, - weight=weight)) + InputExample( + guid=guid, + text_a=text_a, + text_b=text_b, + label=label, + weight=weight)) + return examples + + +class WnliProcessor(DataProcessor): + """Processor for the WNLI data set (GLUE version).""" + + def get_train_examples(self, data_dir): + """See base class.""" + return self._create_examples( + self._read_tsv(os.path.join(data_dir, "train.tsv")), "train") + + def get_dev_examples(self, data_dir): + """See base class.""" + return self._create_examples( + self._read_tsv(os.path.join(data_dir, "dev.tsv")), "dev") + + def get_test_examples(self, data_dir): + """See base class.""" + return self._create_examples( + self._read_tsv(os.path.join(data_dir, "test.tsv")), "test") + + def get_labels(self): + """See base class.""" + return ["0", "1"] + + @staticmethod + def get_processor_name(): + """See base class.""" + return "WNLI" + + def _create_examples(self, lines, set_type): + """Creates examples for the training/dev/test sets.""" + examples = [] + for i, line in enumerate(lines): + if i == 0: + continue + guid = "%s-%s" % (set_type, i) + text_a = tokenization.convert_to_unicode(line[1]) + text_b = tokenization.convert_to_unicode(line[2]) + if set_type == "test": + label = "0" + else: + label = tokenization.convert_to_unicode(line[3]) + examples.append( + InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label)) + return examples + + +class XnliProcessor(DataProcessor): + """Processor for the XNLI data set.""" + supported_languages = [ + "ar", "bg", "de", "el", "en", "es", "fr", "hi", "ru", "sw", "th", "tr", + "ur", "vi", "zh" + ] + + def __init__(self, + language="en", + process_text_fn=tokenization.convert_to_unicode): + super(XnliProcessor, self).__init__(process_text_fn) + if language == "all": + self.languages = XnliProcessor.supported_languages + elif language not in XnliProcessor.supported_languages: + raise ValueError("language %s is not supported for XNLI task." % language) + else: + self.languages = [language] + + def get_train_examples(self, data_dir): + """See base class.""" + lines = [] + for language in self.languages: + # Skips the header. + lines.extend( + self._read_tsv( + os.path.join(data_dir, "multinli", + "multinli.train.%s.tsv" % language))[1:]) + + examples = [] + for i, line in enumerate(lines): + guid = "train-%d" % i + text_a = self.process_text_fn(line[0]) + text_b = self.process_text_fn(line[1]) + label = self.process_text_fn(line[2]) + if label == self.process_text_fn("contradictory"): + label = self.process_text_fn("contradiction") + examples.append( + InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label)) return examples + def get_dev_examples(self, data_dir): + """See base class.""" + lines = self._read_tsv(os.path.join(data_dir, "xnli.dev.tsv")) + examples = [] + for i, line in enumerate(lines): + if i == 0: + continue + guid = "dev-%d" % i + text_a = self.process_text_fn(line[6]) + text_b = self.process_text_fn(line[7]) + label = self.process_text_fn(line[1]) + examples.append( + InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label)) + return examples + + def get_test_examples(self, data_dir): + """See base class.""" + lines = self._read_tsv(os.path.join(data_dir, "xnli.test.tsv")) + examples_by_lang = {k: [] for k in XnliProcessor.supported_languages} + for i, line in enumerate(lines): + if i == 0: + continue + guid = "test-%d" % i + language = self.process_text_fn(line[0]) + text_a = self.process_text_fn(line[6]) + text_b = self.process_text_fn(line[7]) + label = self.process_text_fn(line[1]) + examples_by_lang[language].append( + InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label)) + return examples_by_lang + + def get_labels(self): + """See base class.""" + return ["contradiction", "entailment", "neutral"] + + @staticmethod + def get_processor_name(): + """See base class.""" + return "XNLI" + + +class XtremePawsxProcessor(DataProcessor): + """Processor for the XTREME PAWS-X data set.""" + supported_languages = ["de", "en", "es", "fr", "ja", "ko", "zh"] + + def get_train_examples(self, data_dir): + """See base class.""" + lines = self._read_tsv(os.path.join(data_dir, "train-en.tsv")) + examples = [] + for i, line in enumerate(lines): + guid = "train-%d" % i + text_a = self.process_text_fn(line[0]) + text_b = self.process_text_fn(line[1]) + label = self.process_text_fn(line[2]) + examples.append( + InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label)) + return examples + + def get_dev_examples(self, data_dir): + """See base class.""" + lines = self._read_tsv(os.path.join(data_dir, "dev-en.tsv")) + + examples = [] + for i, line in enumerate(lines): + guid = "dev-%d" % i + text_a = self.process_text_fn(line[0]) + text_b = self.process_text_fn(line[1]) + label = self.process_text_fn(line[2]) + examples.append( + InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label)) + return examples + + def get_test_examples(self, data_dir): + """See base class.""" + examples_by_lang = {k: [] for k in self.supported_languages} + for lang in self.supported_languages: + lines = self._read_tsv(os.path.join(data_dir, f"test-{lang}.tsv")) + for i, line in enumerate(lines): + guid = "test-%d" % i + text_a = self.process_text_fn(line[0]) + text_b = self.process_text_fn(line[1]) + label = "0" + examples_by_lang[lang].append( + InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label)) + return examples_by_lang + + def get_labels(self): + """See base class.""" + return ["0", "1"] + + @staticmethod + def get_processor_name(): + """See base class.""" + return "XTREME-PAWS-X" + + +class XtremeXnliProcessor(DataProcessor): + """Processor for the XTREME XNLI data set.""" + supported_languages = [ + "ar", "bg", "de", "el", "en", "es", "fr", "hi", "ru", "sw", "th", "tr", + "ur", "vi", "zh" + ] + + def get_train_examples(self, data_dir): + """See base class.""" + lines = self._read_tsv(os.path.join(data_dir, "train-en.tsv")) + + examples = [] + for i, line in enumerate(lines): + guid = "train-%d" % i + text_a = self.process_text_fn(line[0]) + text_b = self.process_text_fn(line[1]) + label = self.process_text_fn(line[2]) + examples.append( + InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label)) + return examples + + def get_dev_examples(self, data_dir): + """See base class.""" + lines = self._read_tsv(os.path.join(data_dir, "dev-en.tsv")) + examples = [] + for i, line in enumerate(lines): + guid = "dev-%d" % i + text_a = self.process_text_fn(line[0]) + text_b = self.process_text_fn(line[1]) + label = self.process_text_fn(line[2]) + examples.append( + InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label)) + return examples + + def get_test_examples(self, data_dir): + """See base class.""" + examples_by_lang = {k: [] for k in self.supported_languages} + for lang in self.supported_languages: + lines = self._read_tsv(os.path.join(data_dir, f"test-{lang}.tsv")) + for i, line in enumerate(lines): + guid = f"test-{i}" + text_a = self.process_text_fn(line[0]) + text_b = self.process_text_fn(line[1]) + label = "contradiction" + examples_by_lang[lang].append( + InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label)) + return examples_by_lang + + def get_labels(self): + """See base class.""" + return ["contradiction", "entailment", "neutral"] + + @staticmethod + def get_processor_name(): + """See base class.""" + return "XTREME-XNLI" + def convert_single_example(ex_index, example, label_list, max_seq_length, tokenizer): @@ -748,8 +1042,9 @@ def convert_single_example(ex_index, example, label_list, max_seq_length, logging.info("input_ids: %s", " ".join([str(x) for x in input_ids])) logging.info("input_mask: %s", " ".join([str(x) for x in input_mask])) logging.info("segment_ids: %s", " ".join([str(x) for x in segment_ids])) - logging.info("label: %s (id = %d)", example.label, label_id) + logging.info("label: %s (id = %s)", example.label, str(label_id)) logging.info("weight: %s", example.weight) + logging.info("int_iden: %s", str(example.int_iden)) feature = InputFeatures( input_ids=input_ids, @@ -757,19 +1052,24 @@ def convert_single_example(ex_index, example, label_list, max_seq_length, segment_ids=segment_ids, label_id=label_id, is_real_example=True, - weight=example.weight) + weight=example.weight, + int_iden=example.int_iden) + return feature -def file_based_convert_examples_to_features(examples, label_list, - max_seq_length, tokenizer, - output_file, label_type=None): +def file_based_convert_examples_to_features(examples, + label_list, + max_seq_length, + tokenizer, + output_file, + label_type=None): """Convert a set of `InputExample`s to a TFRecord file.""" tf.io.gfile.makedirs(os.path.dirname(output_file)) writer = tf.io.TFRecordWriter(output_file) - for (ex_index, example) in enumerate(examples): + for ex_index, example in enumerate(examples): if ex_index % 10000 == 0: logging.info("Writing example %d of %d", ex_index, len(examples)) @@ -779,6 +1079,7 @@ def file_based_convert_examples_to_features(examples, label_list, def create_int_feature(values): f = tf.train.Feature(int64_list=tf.train.Int64List(value=list(values))) return f + def create_float_feature(values): f = tf.train.Feature(float_list=tf.train.FloatList(value=list(values))) return f @@ -789,12 +1090,14 @@ def file_based_convert_examples_to_features(examples, label_list, features["segment_ids"] = create_int_feature(feature.segment_ids) if label_type is not None and label_type == float: features["label_ids"] = create_float_feature([feature.label_id]) - else: + elif feature.label_id is not None: features["label_ids"] = create_int_feature([feature.label_id]) features["is_real_example"] = create_int_feature( [int(feature.is_real_example)]) if feature.weight is not None: features["weight"] = create_float_feature([feature.weight]) + if feature.int_iden is not None: + features["int_iden"] = create_int_feature([feature.int_iden]) tf_example = tf.train.Example(features=tf.train.Features(feature=features)) writer.write(tf_example.SerializeToString()) @@ -830,8 +1133,7 @@ def generate_tf_record_from_data_file(processor, Arguments: processor: Input processor object to be used for generating data. Subclass of `DataProcessor`. - data_dir: Directory that contains train/eval data to process. Data files - should be in from "dev.tsv", "test.tsv", or "train.tsv". + data_dir: Directory that contains train/eval/test data to process. tokenizer: The tokenizer to be applied on the data. train_data_output_path: Output to which processed tf record for training will be saved. @@ -857,8 +1159,7 @@ def generate_tf_record_from_data_file(processor, train_input_data_examples = processor.get_train_examples(data_dir) file_based_convert_examples_to_features(train_input_data_examples, label_list, max_seq_length, tokenizer, - train_data_output_path, - label_type) + train_data_output_path, label_type) num_training_data = len(train_input_data_examples) if eval_data_output_path: @@ -868,26 +1169,27 @@ def generate_tf_record_from_data_file(processor, tokenizer, eval_data_output_path, label_type) + meta_data = { + "processor_type": processor.get_processor_name(), + "train_data_size": num_training_data, + "max_seq_length": max_seq_length, + } + if test_data_output_path: test_input_data_examples = processor.get_test_examples(data_dir) if isinstance(test_input_data_examples, dict): for language, examples in test_input_data_examples.items(): file_based_convert_examples_to_features( - examples, - label_list, max_seq_length, - tokenizer, test_data_output_path.format(language), - label_type) + examples, label_list, max_seq_length, tokenizer, + test_data_output_path.format(language), label_type) + meta_data["test_{}_data_size".format(language)] = len(examples) else: file_based_convert_examples_to_features(test_input_data_examples, label_list, max_seq_length, tokenizer, test_data_output_path, label_type) + meta_data["test_data_size"] = len(test_input_data_examples) - meta_data = { - "processor_type": processor.get_processor_name(), - "train_data_size": num_training_data, - "max_seq_length": max_seq_length, - } if is_regression: meta_data["task_type"] = "bert_regression" meta_data["label_type"] = {int: "int", float: "float"}[label_type] @@ -900,12 +1202,4 @@ def generate_tf_record_from_data_file(processor, if eval_data_output_path: meta_data["eval_data_size"] = len(eval_input_data_examples) - if test_data_output_path: - test_input_data_examples = processor.get_test_examples(data_dir) - if isinstance(test_input_data_examples, dict): - for language, examples in test_input_data_examples.items(): - meta_data["test_{}_data_size".format(language)] = len(examples) - else: - meta_data["test_data_size"] = len(test_input_data_examples) - return meta_data diff --git a/official/nlp/data/create_finetuning_data.py b/official/nlp/data/create_finetuning_data.py index 256c1dee0adad8b4e35a58212e62573edd946b6b..403d66b41c5b728cb3da5e3d31eeea535defbc91 100644 --- a/official/nlp/data/create_finetuning_data.py +++ b/official/nlp/data/create_finetuning_data.py @@ -27,18 +27,21 @@ from absl import flags import tensorflow as tf from official.nlp.bert import tokenization from official.nlp.data import classifier_data_lib +from official.nlp.data import sentence_retrieval_lib # word-piece tokenizer based squad_lib from official.nlp.data import squad_lib as squad_lib_wp # sentence-piece tokenizer based squad_lib from official.nlp.data import squad_lib_sp +from official.nlp.data import tagging_data_lib FLAGS = flags.FLAGS +# TODO(chendouble): consider moving each task to its own binary. flags.DEFINE_enum( "fine_tuning_task_type", "classification", - ["classification", "regression", "squad"], + ["classification", "regression", "squad", "retrieval", "tagging"], "The name of the BERT fine tuning task for which data " - "will be generated..") + "will be generated.") # BERT classification specific flags. flags.DEFINE_string( @@ -47,23 +50,41 @@ flags.DEFINE_string( "for the task.") flags.DEFINE_enum("classification_task_name", "MNLI", - ["COLA", "MNLI", "MRPC", "QNLI", "QQP", "SST-2", "XNLI", - "PAWS-X"], - "The name of the task to train BERT classifier.") + ["COLA", "MNLI", "MRPC", "PAWS-X", "QNLI", "QQP", "RTE", + "SST-2", "STS-B", "WNLI", "XNLI", "XTREME-XNLI", + "XTREME-PAWS-X"], + "The name of the task to train BERT classifier. The " + "difference between XTREME-XNLI and XNLI is: 1. the format " + "of input tsv files; 2. the dev set for XTREME is english " + "only and for XNLI is all languages combined. Same for " + "PAWS-X.") + +# MNLI task-specific flag. +flags.DEFINE_enum( + "mnli_type", "matched", ["matched", "mismatched"], + "The type of MNLI dataset.") -# XNLI task specific flag. +# XNLI task-specific flag. flags.DEFINE_string( "xnli_language", "en", - "Language of training data for XNIL task. If the value is 'all', the data " + "Language of training data for XNLI task. If the value is 'all', the data " "of all languages will be used for training.") -# PAWS-X task specific flag. +# PAWS-X task-specific flag. flags.DEFINE_string( "pawsx_language", "en", - "Language of trainig data for PAWS-X task. If the value is 'all', the data " + "Language of training data for PAWS-X task. If the value is 'all', the data " "of all languages will be used for training.") -# BERT Squad task specific flags. +# Retrieval task-specific flags. +flags.DEFINE_enum("retrieval_task_name", "bucc", ["bucc", "tatoeba"], + "The name of sentence retrieval task for scoring") + +# Tagging task-specific flags. +flags.DEFINE_enum("tagging_task_name", "panx", ["panx", "udpos"], + "The name of BERT tagging (token classification) task.") + +# BERT Squad task-specific flags. flags.DEFINE_string( "squad_data_file", None, "The input data file in for generating training data for BERT squad task.") @@ -163,20 +184,29 @@ def generate_classifier_dataset(): "cola": classifier_data_lib.ColaProcessor, "mnli": - classifier_data_lib.MnliProcessor, + functools.partial(classifier_data_lib.MnliProcessor, + mnli_type=FLAGS.mnli_type), "mrpc": classifier_data_lib.MrpcProcessor, "qnli": classifier_data_lib.QnliProcessor, "qqp": classifier_data_lib.QqpProcessor, + "rte": classifier_data_lib.RteProcessor, "sst-2": classifier_data_lib.SstProcessor, + "sts-b": + classifier_data_lib.StsBProcessor, "xnli": functools.partial(classifier_data_lib.XnliProcessor, language=FLAGS.xnli_language), "paws-x": functools.partial(classifier_data_lib.PawsxProcessor, - language=FLAGS.pawsx_language) + language=FLAGS.pawsx_language), + "wnli": classifier_data_lib.WnliProcessor, + "xtreme-xnli": + functools.partial(classifier_data_lib.XtremeXnliProcessor), + "xtreme-paws-x": + functools.partial(classifier_data_lib.XtremePawsxProcessor) } task_name = FLAGS.classification_task_name.lower() if task_name not in processors: @@ -237,6 +267,67 @@ def generate_squad_dataset(): FLAGS.max_query_length, FLAGS.doc_stride, FLAGS.version_2_with_negative) +def generate_retrieval_dataset(): + """Generate retrieval test and dev dataset and returns input meta data.""" + assert (FLAGS.input_data_dir and FLAGS.retrieval_task_name) + if FLAGS.tokenizer_impl == "word_piece": + tokenizer = tokenization.FullTokenizer( + vocab_file=FLAGS.vocab_file, do_lower_case=FLAGS.do_lower_case) + processor_text_fn = tokenization.convert_to_unicode + else: + assert FLAGS.tokenizer_impl == "sentence_piece" + tokenizer = tokenization.FullSentencePieceTokenizer(FLAGS.sp_model_file) + processor_text_fn = functools.partial( + tokenization.preprocess_text, lower=FLAGS.do_lower_case) + + processors = { + "bucc": sentence_retrieval_lib.BuccProcessor, + "tatoeba": sentence_retrieval_lib.TatoebaProcessor, + } + + task_name = FLAGS.retrieval_task_name.lower() + if task_name not in processors: + raise ValueError("Task not found: %s" % task_name) + + processor = processors[task_name](process_text_fn=processor_text_fn) + + return sentence_retrieval_lib.generate_sentence_retrevial_tf_record( + processor, + FLAGS.input_data_dir, + tokenizer, + FLAGS.eval_data_output_path, + FLAGS.test_data_output_path, + FLAGS.max_seq_length) + + +def generate_tagging_dataset(): + """Generates tagging dataset.""" + processors = { + "panx": tagging_data_lib.PanxProcessor, + "udpos": tagging_data_lib.UdposProcessor, + } + task_name = FLAGS.tagging_task_name.lower() + if task_name not in processors: + raise ValueError("Task not found: %s" % task_name) + + if FLAGS.tokenizer_impl == "word_piece": + tokenizer = tokenization.FullTokenizer( + vocab_file=FLAGS.vocab_file, do_lower_case=FLAGS.do_lower_case) + processor_text_fn = tokenization.convert_to_unicode + elif FLAGS.tokenizer_impl == "sentence_piece": + tokenizer = tokenization.FullSentencePieceTokenizer(FLAGS.sp_model_file) + processor_text_fn = functools.partial( + tokenization.preprocess_text, lower=FLAGS.do_lower_case) + else: + raise ValueError("Unsupported tokenizer_impl: %s" % FLAGS.tokenizer_impl) + + processor = processors[task_name]() + return tagging_data_lib.generate_tf_record_from_data_file( + processor, FLAGS.input_data_dir, tokenizer, FLAGS.max_seq_length, + FLAGS.train_data_output_path, FLAGS.eval_data_output_path, + FLAGS.test_data_output_path, processor_text_fn) + + def main(_): if FLAGS.tokenizer_impl == "word_piece": if not FLAGS.vocab_file: @@ -248,12 +339,20 @@ def main(_): raise ValueError( "FLAG sp_model_file for sentence-piece tokenizer is not specified.") + if FLAGS.fine_tuning_task_type != "retrieval": + flags.mark_flag_as_required("train_data_output_path") + if FLAGS.fine_tuning_task_type == "classification": input_meta_data = generate_classifier_dataset() elif FLAGS.fine_tuning_task_type == "regression": input_meta_data = generate_regression_dataset() - else: + elif FLAGS.fine_tuning_task_type == "retrieval": + input_meta_data = generate_retrieval_dataset() + elif FLAGS.fine_tuning_task_type == "squad": input_meta_data = generate_squad_dataset() + else: + assert FLAGS.fine_tuning_task_type == "tagging" + input_meta_data = generate_tagging_dataset() tf.io.gfile.makedirs(os.path.dirname(FLAGS.meta_data_file_path)) with tf.io.gfile.GFile(FLAGS.meta_data_file_path, "w") as writer: @@ -261,6 +360,5 @@ def main(_): if __name__ == "__main__": - flags.mark_flag_as_required("train_data_output_path") flags.mark_flag_as_required("meta_data_file_path") app.run(main) diff --git a/official/nlp/data/create_pretraining_data.py b/official/nlp/data/create_pretraining_data.py index 79dac57ac8775687673604af6fb2fb50c9f74244..fff6391cee95d209be8f785fd43dd73184a65d11 100644 --- a/official/nlp/data/create_pretraining_data.py +++ b/official/nlp/data/create_pretraining_data.py @@ -18,6 +18,7 @@ from __future__ import division from __future__ import print_function import collections +import itertools import random from absl import app @@ -48,6 +49,12 @@ flags.DEFINE_bool( "do_whole_word_mask", False, "Whether to use whole word masking rather than per-WordPiece masking.") +flags.DEFINE_integer( + "max_ngram_size", None, + "Mask contiguous whole words (n-grams) of up to `max_ngram_size` using a " + "weighting scheme to favor shorter n-grams. " + "Note: `--do_whole_word_mask=True` must also be set when n-gram masking.") + flags.DEFINE_bool( "gzip_compress", False, "Whether to use `GZIP` compress option to get compressed TFRecord files.") @@ -192,7 +199,8 @@ def create_training_instances(input_files, masked_lm_prob, max_predictions_per_seq, rng, - do_whole_word_mask=False): + do_whole_word_mask=False, + max_ngram_size=None): """Create `TrainingInstance`s from raw text.""" all_documents = [[]] @@ -229,7 +237,7 @@ def create_training_instances(input_files, create_instances_from_document( all_documents, document_index, max_seq_length, short_seq_prob, masked_lm_prob, max_predictions_per_seq, vocab_words, rng, - do_whole_word_mask)) + do_whole_word_mask, max_ngram_size)) rng.shuffle(instances) return instances @@ -238,7 +246,8 @@ def create_training_instances(input_files, def create_instances_from_document( all_documents, document_index, max_seq_length, short_seq_prob, masked_lm_prob, max_predictions_per_seq, vocab_words, rng, - do_whole_word_mask=False): + do_whole_word_mask=False, + max_ngram_size=None): """Creates `TrainingInstance`s for a single document.""" document = all_documents[document_index] @@ -337,7 +346,7 @@ def create_instances_from_document( (tokens, masked_lm_positions, masked_lm_labels) = create_masked_lm_predictions( tokens, masked_lm_prob, max_predictions_per_seq, vocab_words, rng, - do_whole_word_mask) + do_whole_word_mask, max_ngram_size) instance = TrainingInstance( tokens=tokens, segment_ids=segment_ids, @@ -355,72 +364,238 @@ def create_instances_from_document( MaskedLmInstance = collections.namedtuple("MaskedLmInstance", ["index", "label"]) +# A _Gram is a [half-open) interval of token indices which form a word. +# E.g., +# words: ["The", "doghouse"] +# tokens: ["The", "dog", "##house"] +# grams: [(0,1), (1,3)] +_Gram = collections.namedtuple("_Gram", ["begin", "end"]) + + +def _window(iterable, size): + """Helper to create a sliding window iterator with a given size. + + E.g., + input = [1, 2, 3, 4] + _window(input, 1) => [1], [2], [3], [4] + _window(input, 2) => [1, 2], [2, 3], [3, 4] + _window(input, 3) => [1, 2, 3], [2, 3, 4] + _window(input, 4) => [1, 2, 3, 4] + _window(input, 5) => None + + Arguments: + iterable: elements to iterate over. + size: size of the window. + + Yields: + Elements of `iterable` batched into a sliding window of length `size`. + """ + i = iter(iterable) + window = [] + try: + for e in range(0, size): + window.append(next(i)) + yield window + except StopIteration: + # handle the case where iterable's length is less than the window size. + return + for e in i: + window = window[1:] + [e] + yield window + + +def _contiguous(sorted_grams): + """Test whether a sequence of grams is contiguous. + + Arguments: + sorted_grams: _Grams which are sorted in increasing order. + Returns: + True if `sorted_grams` are touching each other. + + E.g., + _contiguous([(1, 4), (4, 5), (5, 10)]) == True + _contiguous([(1, 2), (4, 5)]) == False + """ + for a, b in _window(sorted_grams, 2): + if a.end != b.begin: + return False + return True + + +def _masking_ngrams(grams, max_ngram_size, max_masked_tokens, rng): + """Create a list of masking {1, ..., n}-grams from a list of one-grams. + + This is an extention of 'whole word masking' to mask multiple, contiguous + words such as (e.g., "the red boat"). + + Each input gram represents the token indices of a single word, + words: ["the", "red", "boat"] + tokens: ["the", "red", "boa", "##t"] + grams: [(0,1), (1,2), (2,4)] + + For a `max_ngram_size` of three, possible outputs masks include: + 1-grams: (0,1), (1,2), (2,4) + 2-grams: (0,2), (1,4) + 3-grams; (0,4) + + Output masks will not overlap and contain less than `max_masked_tokens` total + tokens. E.g., for the example above with `max_masked_tokens` as three, + valid outputs are, + [(0,1), (1,2)] # "the", "red" covering two tokens + [(1,2), (2,4)] # "red", "boa", "##t" covering three tokens + + The length of the selected n-gram follows a zipf weighting to + favor shorter n-gram sizes (weight(1)=1, weight(2)=1/2, weight(3)=1/3, ...). + + Arguments: + grams: List of one-grams. + max_ngram_size: Maximum number of contiguous one-grams combined to create + an n-gram. + max_masked_tokens: Maximum total number of tokens to be masked. + rng: `random.Random` generator. + + Returns: + A list of n-grams to be used as masks. + """ + if not grams: + return None + + grams = sorted(grams) + num_tokens = grams[-1].end + + # Ensure our grams are valid (i.e., they don't overlap). + for a, b in _window(grams, 2): + if a.end > b.begin: + raise ValueError("overlapping grams: {}".format(grams)) + + # Build map from n-gram length to list of n-grams. + ngrams = {i: [] for i in range(1, max_ngram_size+1)} + for gram_size in range(1, max_ngram_size+1): + for g in _window(grams, gram_size): + if _contiguous(g): + # Add an n-gram which spans these one-grams. + ngrams[gram_size].append(_Gram(g[0].begin, g[-1].end)) + + # Shuffle each list of n-grams. + for v in ngrams.values(): + rng.shuffle(v) + + # Create the weighting for n-gram length selection. + # Stored cummulatively for `random.choices` below. + cummulative_weights = list( + itertools.accumulate([1./n for n in range(1, max_ngram_size+1)])) + + output_ngrams = [] + # Keep a bitmask of which tokens have been masked. + masked_tokens = [False] * num_tokens + # Loop until we have enough masked tokens or there are no more candidate + # n-grams of any length. + # Each code path should ensure one or more elements from `ngrams` are removed + # to guarentee this loop terminates. + while (sum(masked_tokens) < max_masked_tokens and + sum(len(s) for s in ngrams.values())): + # Pick an n-gram size based on our weights. + sz = random.choices(range(1, max_ngram_size+1), + cum_weights=cummulative_weights)[0] + + # Ensure this size doesn't result in too many masked tokens. + # E.g., a two-gram contains _at least_ two tokens. + if sum(masked_tokens) + sz > max_masked_tokens: + # All n-grams of this length are too long and can be removed from + # consideration. + ngrams[sz].clear() + continue -def create_masked_lm_predictions(tokens, masked_lm_prob, - max_predictions_per_seq, vocab_words, rng, - do_whole_word_mask): - """Creates the predictions for the masked LM objective.""" + # All of the n-grams of this size have been used. + if not ngrams[sz]: + continue + + # Choose a random n-gram of the given size. + gram = ngrams[sz].pop() + num_gram_tokens = gram.end-gram.begin + + # Check if this would add too many tokens. + if num_gram_tokens + sum(masked_tokens) > max_masked_tokens: + continue + + # Check if any of the tokens in this gram have already been masked. + if sum(masked_tokens[gram.begin:gram.end]): + continue - cand_indexes = [] - for (i, token) in enumerate(tokens): - if token == "[CLS]" or token == "[SEP]": + # Found a usable n-gram! Mark its tokens as masked and add it to return. + masked_tokens[gram.begin:gram.end] = [True] * (gram.end-gram.begin) + output_ngrams.append(gram) + return output_ngrams + + +def _wordpieces_to_grams(tokens): + """Reconstitue grams (words) from `tokens`. + + E.g., + tokens: ['[CLS]', 'That', 'lit', '##tle', 'blue', 'tru', '##ck', '[SEP]'] + grams: [ [1,2), [2, 4), [4,5) , [5, 6)] + + Arguments: + tokens: list of wordpieces + Returns: + List of _Grams representing spans of whole words + (without "[CLS]" and "[SEP]"). + """ + grams = [] + gram_start_pos = None + for i, token in enumerate(tokens): + if gram_start_pos is not None and token.startswith("##"): continue - # Whole Word Masking means that if we mask all of the wordpieces - # corresponding to an original word. When a word has been split into - # WordPieces, the first token does not have any marker and any subsequence - # tokens are prefixed with ##. So whenever we see the ## token, we - # append it to the previous set of word indexes. - # - # Note that Whole Word Masking does *not* change the training code - # at all -- we still predict each WordPiece independently, softmaxed - # over the entire vocabulary. - if (do_whole_word_mask and len(cand_indexes) >= 1 and - token.startswith("##")): - cand_indexes[-1].append(i) + if gram_start_pos is not None: + grams.append(_Gram(gram_start_pos, i)) + if token not in ["[CLS]", "[SEP]"]: + gram_start_pos = i else: - cand_indexes.append([i]) + gram_start_pos = None + if gram_start_pos is not None: + grams.append(_Gram(gram_start_pos, len(tokens))) + return grams - rng.shuffle(cand_indexes) - output_tokens = list(tokens) +def create_masked_lm_predictions(tokens, masked_lm_prob, + max_predictions_per_seq, vocab_words, rng, + do_whole_word_mask, + max_ngram_size=None): + """Creates the predictions for the masked LM objective.""" + if do_whole_word_mask: + grams = _wordpieces_to_grams(tokens) + else: + # Here we consider each token to be a word to allow for sub-word masking. + if max_ngram_size: + raise ValueError("cannot use ngram masking without whole word masking") + grams = [_Gram(i, i+1) for i in range(0, len(tokens)) + if tokens[i] not in ["[CLS]", "[SEP]"]] num_to_predict = min(max_predictions_per_seq, max(1, int(round(len(tokens) * masked_lm_prob)))) - + # Generate masks. If `max_ngram_size` in [0, None] it means we're doing + # whole word masking or token level masking. Both of these can be treated + # as the `max_ngram_size=1` case. + masked_grams = _masking_ngrams(grams, max_ngram_size or 1, + num_to_predict, rng) masked_lms = [] - covered_indexes = set() - for index_set in cand_indexes: - if len(masked_lms) >= num_to_predict: - break - # If adding a whole-word mask would exceed the maximum number of - # predictions, then just skip this candidate. - if len(masked_lms) + len(index_set) > num_to_predict: - continue - is_any_index_covered = False - for index in index_set: - if index in covered_indexes: - is_any_index_covered = True - break - if is_any_index_covered: - continue - for index in index_set: - covered_indexes.add(index) - - masked_token = None - # 80% of the time, replace with [MASK] - if rng.random() < 0.8: - masked_token = "[MASK]" + output_tokens = list(tokens) + for gram in masked_grams: + # 80% of the time, replace all n-gram tokens with [MASK] + if rng.random() < 0.8: + replacement_action = lambda idx: "[MASK]" + else: + # 10% of the time, keep all the original n-gram tokens. + if rng.random() < 0.5: + replacement_action = lambda idx: tokens[idx] + # 10% of the time, replace each n-gram token with a random word. else: - # 10% of the time, keep original - if rng.random() < 0.5: - masked_token = tokens[index] - # 10% of the time, replace with random word - else: - masked_token = vocab_words[rng.randint(0, len(vocab_words) - 1)] + replacement_action = lambda idx: rng.choice(vocab_words) - output_tokens[index] = masked_token + for idx in range(gram.begin, gram.end): + output_tokens[idx] = replacement_action(idx) + masked_lms.append(MaskedLmInstance(index=idx, label=tokens[idx])) - masked_lms.append(MaskedLmInstance(index=index, label=tokens[index])) assert len(masked_lms) <= num_to_predict masked_lms = sorted(masked_lms, key=lambda x: x.index) @@ -467,7 +642,7 @@ def main(_): instances = create_training_instances( input_files, tokenizer, FLAGS.max_seq_length, FLAGS.dupe_factor, FLAGS.short_seq_prob, FLAGS.masked_lm_prob, FLAGS.max_predictions_per_seq, - rng, FLAGS.do_whole_word_mask) + rng, FLAGS.do_whole_word_mask, FLAGS.max_ngram_size) output_files = FLAGS.output_file.split(",") logging.info("*** Writing to output files ***") diff --git a/official/nlp/data/data_loader_factory.py b/official/nlp/data/data_loader_factory.py new file mode 100644 index 0000000000000000000000000000000000000000..a88caea67fe93f4b5166bb8bcf97841082fdd449 --- /dev/null +++ b/official/nlp/data/data_loader_factory.py @@ -0,0 +1,59 @@ +# Lint as: python3 +# Copyright 2020 The TensorFlow Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# ============================================================================== +"""A global factory to access NLP registered data loaders.""" + +from official.utils import registry + +_REGISTERED_DATA_LOADER_CLS = {} + + +def register_data_loader_cls(data_config_cls): + """Decorates a factory of DataLoader for lookup by a subclass of DataConfig. + + This decorator supports registration of data loaders as follows: + + ``` + @dataclasses.dataclass + class MyDataConfig(DataConfig): + # Add fields here. + pass + + @register_data_loader_cls(MyDataConfig) + class MyDataLoader: + # Inherits def __init__(self, data_config). + pass + + my_data_config = MyDataConfig() + + # Returns MyDataLoader(my_data_config). + my_loader = get_data_loader(my_data_config) + ``` + + Args: + data_config_cls: a subclass of DataConfig (*not* an instance + of DataConfig). + + Returns: + A callable for use as class decorator that registers the decorated class + for creation from an instance of data_config_cls. + """ + return registry.register(_REGISTERED_DATA_LOADER_CLS, data_config_cls) + + +def get_data_loader(data_config): + """Creates a data_loader from data_config.""" + return registry.lookup(_REGISTERED_DATA_LOADER_CLS, data_config.__class__)( + data_config) diff --git a/official/nlp/data/pretrain_dataloader.py b/official/nlp/data/pretrain_dataloader.py index 18325090caa6d83e68b4077aac4a27ee69bea938..985a7a5cc6c3f2e8a811d4fafbe6c731a1033f20 100644 --- a/official/nlp/data/pretrain_dataloader.py +++ b/official/nlp/data/pretrain_dataloader.py @@ -16,11 +16,27 @@ """Loads dataset for the BERT pretraining task.""" from typing import Mapping, Optional +import dataclasses import tensorflow as tf from official.core import input_reader +from official.modeling.hyperparams import config_definitions as cfg +from official.nlp.data import data_loader_factory +@dataclasses.dataclass +class BertPretrainDataConfig(cfg.DataConfig): + """Data config for BERT pretraining task (tasks/masked_lm).""" + input_path: str = '' + global_batch_size: int = 512 + is_training: bool = True + seq_length: int = 512 + max_predictions_per_seq: int = 76 + use_next_sentence_label: bool = True + use_position_id: bool = False + + +@data_loader_factory.register_data_loader_cls(BertPretrainDataConfig) class BertPretrainDataLoader: """A class to load dataset for bert pretraining task.""" @@ -91,7 +107,5 @@ class BertPretrainDataLoader: def load(self, input_context: Optional[tf.distribute.InputContext] = None): """Returns a tf.dataset.Dataset.""" reader = input_reader.InputReader( - params=self._params, - decoder_fn=self._decode, - parser_fn=self._parse) + params=self._params, decoder_fn=self._decode, parser_fn=self._parse) return reader.read(input_context) diff --git a/official/nlp/data/question_answering_dataloader.py b/official/nlp/data/question_answering_dataloader.py new file mode 100644 index 0000000000000000000000000000000000000000..08c7047e4afd80999899c34f2c5855ad2ef18634 --- /dev/null +++ b/official/nlp/data/question_answering_dataloader.py @@ -0,0 +1,95 @@ +# Lint as: python3 +# Copyright 2020 The TensorFlow Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# ============================================================================== +"""Loads dataset for the question answering (e.g, SQuAD) task.""" +from typing import Mapping, Optional +import dataclasses +import tensorflow as tf + +from official.core import input_reader +from official.modeling.hyperparams import config_definitions as cfg +from official.nlp.data import data_loader_factory + + +@dataclasses.dataclass +class QADataConfig(cfg.DataConfig): + """Data config for question answering task (tasks/question_answering).""" + input_path: str = '' + global_batch_size: int = 48 + is_training: bool = True + seq_length: int = 384 + # Settings below are question answering specific. + version_2_with_negative: bool = False + # Settings below are only used for eval mode. + input_preprocessed_data_path: str = '' + doc_stride: int = 128 + query_length: int = 64 + vocab_file: str = '' + tokenization: str = 'WordPiece' # WordPiece or SentencePiece + do_lower_case: bool = True + + +@data_loader_factory.register_data_loader_cls(QADataConfig) +class QuestionAnsweringDataLoader: + """A class to load dataset for sentence prediction (classification) task.""" + + def __init__(self, params): + self._params = params + self._seq_length = params.seq_length + self._is_training = params.is_training + + def _decode(self, record: tf.Tensor): + """Decodes a serialized tf.Example.""" + name_to_features = { + 'input_ids': tf.io.FixedLenFeature([self._seq_length], tf.int64), + 'input_mask': tf.io.FixedLenFeature([self._seq_length], tf.int64), + 'segment_ids': tf.io.FixedLenFeature([self._seq_length], tf.int64), + } + if self._is_training: + name_to_features['start_positions'] = tf.io.FixedLenFeature([], tf.int64) + name_to_features['end_positions'] = tf.io.FixedLenFeature([], tf.int64) + else: + name_to_features['unique_ids'] = tf.io.FixedLenFeature([], tf.int64) + example = tf.io.parse_single_example(record, name_to_features) + + # tf.Example only supports tf.int64, but the TPU only supports tf.int32. + # So cast all int64 to int32. + for name in example: + t = example[name] + if t.dtype == tf.int64: + t = tf.cast(t, tf.int32) + example[name] = t + + return example + + def _parse(self, record: Mapping[str, tf.Tensor]): + """Parses raw tensors into a dict of tensors to be consumed by the model.""" + x, y = {}, {} + for name, tensor in record.items(): + if name in ('start_positions', 'end_positions'): + y[name] = tensor + elif name == 'input_ids': + x['input_word_ids'] = tensor + elif name == 'segment_ids': + x['input_type_ids'] = tensor + else: + x[name] = tensor + return (x, y) + + def load(self, input_context: Optional[tf.distribute.InputContext] = None): + """Returns a tf.dataset.Dataset.""" + reader = input_reader.InputReader( + params=self._params, decoder_fn=self._decode, parser_fn=self._parse) + return reader.read(input_context) diff --git a/official/nlp/data/sentence_prediction_dataloader.py b/official/nlp/data/sentence_prediction_dataloader.py index 60dd788403725aeeca2028b237c3330bbf22716c..57c068c8654ae363dcc50b081cac69d8cdb2536c 100644 --- a/official/nlp/data/sentence_prediction_dataloader.py +++ b/official/nlp/data/sentence_prediction_dataloader.py @@ -15,11 +15,28 @@ # ============================================================================== """Loads dataset for the sentence prediction (classification) task.""" from typing import Mapping, Optional +import dataclasses import tensorflow as tf from official.core import input_reader +from official.modeling.hyperparams import config_definitions as cfg +from official.nlp.data import data_loader_factory +LABEL_TYPES_MAP = {'int': tf.int64, 'float': tf.float32} + + +@dataclasses.dataclass +class SentencePredictionDataConfig(cfg.DataConfig): + """Data config for sentence prediction task (tasks/sentence_prediction).""" + input_path: str = '' + global_batch_size: int = 32 + is_training: bool = True + seq_length: int = 128 + label_type: str = 'int' + + +@data_loader_factory.register_data_loader_cls(SentencePredictionDataConfig) class SentencePredictionDataLoader: """A class to load dataset for sentence prediction (classification) task.""" @@ -29,11 +46,12 @@ class SentencePredictionDataLoader: def _decode(self, record: tf.Tensor): """Decodes a serialized tf.Example.""" + label_type = LABEL_TYPES_MAP[self._params.label_type] name_to_features = { 'input_ids': tf.io.FixedLenFeature([self._seq_length], tf.int64), 'input_mask': tf.io.FixedLenFeature([self._seq_length], tf.int64), 'segment_ids': tf.io.FixedLenFeature([self._seq_length], tf.int64), - 'label_ids': tf.io.FixedLenFeature([], tf.int64), + 'label_ids': tf.io.FixedLenFeature([], label_type), } example = tf.io.parse_single_example(record, name_to_features) diff --git a/official/nlp/data/sentence_retrieval_lib.py b/official/nlp/data/sentence_retrieval_lib.py new file mode 100644 index 0000000000000000000000000000000000000000..d8e83ae579f8221b93e790ea62b91c3d6d2b9e90 --- /dev/null +++ b/official/nlp/data/sentence_retrieval_lib.py @@ -0,0 +1,168 @@ +# Copyright 2020 The TensorFlow Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# ============================================================================== +"""BERT library to process data for cross lingual sentence retrieval task.""" + +import os + +from absl import logging +from official.nlp.bert import tokenization +from official.nlp.data import classifier_data_lib + + +class BuccProcessor(classifier_data_lib.DataProcessor): + """Procssor for Xtreme BUCC data set.""" + supported_languages = ["de", "fr", "ru", "zh"] + + def __init__(self, + process_text_fn=tokenization.convert_to_unicode): + super(BuccProcessor, self).__init__(process_text_fn) + self.languages = BuccProcessor.supported_languages + + def get_dev_examples(self, data_dir, file_pattern): + return self._create_examples( + self._read_tsv(os.path.join(data_dir, file_pattern.format("dev"))), + "sample") + + def get_test_examples(self, data_dir, file_pattern): + return self._create_examples( + self._read_tsv(os.path.join(data_dir, file_pattern.format("test"))), + "test") + + @staticmethod + def get_processor_name(): + """See base class.""" + return "BUCC" + + def _create_examples(self, lines, set_type): + """Creates examples for the training and dev sets.""" + examples = [] + for (i, line) in enumerate(lines): + guid = "%s-%s" % (set_type, i) + int_iden = int(line[0].split("-")[1]) + text_a = self.process_text_fn(line[1]) + examples.append( + classifier_data_lib.InputExample( + guid=guid, text_a=text_a, int_iden=int_iden)) + return examples + + +class TatoebaProcessor(classifier_data_lib.DataProcessor): + """Procssor for Xtreme Tatoeba data set.""" + supported_languages = [ + "af", "ar", "bg", "bn", "de", "el", "es", "et", "eu", "fa", "fi", "fr", + "he", "hi", "hu", "id", "it", "ja", "jv", "ka", "kk", "ko", "ml", "mr", + "nl", "pt", "ru", "sw", "ta", "te", "th", "tl", "tr", "ur", "vi", "zh" + ] + + def __init__(self, + process_text_fn=tokenization.convert_to_unicode): + super(TatoebaProcessor, self).__init__(process_text_fn) + self.languages = TatoebaProcessor.supported_languages + + def get_test_examples(self, data_dir, file_path): + return self._create_examples( + self._read_tsv(os.path.join(data_dir, file_path)), "test") + + @staticmethod + def get_processor_name(): + """See base class.""" + return "TATOEBA" + + def _create_examples(self, lines, set_type): + """Creates examples for the training and dev sets.""" + examples = [] + for (i, line) in enumerate(lines): + guid = "%s-%s" % (set_type, i) + text_a = self.process_text_fn(line[0]) + examples.append( + classifier_data_lib.InputExample( + guid=guid, text_a=text_a, int_iden=i)) + return examples + + +def generate_sentence_retrevial_tf_record(processor, + data_dir, + tokenizer, + eval_data_output_path=None, + test_data_output_path=None, + max_seq_length=128): + """Generates the tf records for retrieval tasks. + + Args: + processor: Input processor object to be used for generating data. Subclass + of `DataProcessor`. + data_dir: Directory that contains train/eval data to process. Data files + should be in from. + tokenizer: The tokenizer to be applied on the data. + eval_data_output_path: Output to which processed tf record for evaluation + will be saved. + test_data_output_path: Output to which processed tf record for testing + will be saved. Must be a pattern template with {} if processor has + language specific test data. + max_seq_length: Maximum sequence length of the to be generated + training/eval data. + + Returns: + A dictionary containing input meta data. + """ + assert eval_data_output_path or test_data_output_path + + if processor.get_processor_name() == "BUCC": + path_pattern = "{}-en.{{}}.{}" + + if processor.get_processor_name() == "TATOEBA": + path_pattern = "{}-en.{}" + + meta_data = { + "processor_type": processor.get_processor_name(), + "max_seq_length": max_seq_length, + "number_eval_data": {}, + "number_test_data": {}, + } + logging.info("Start to process %s task data", processor.get_processor_name()) + + for lang_a in processor.languages: + for lang_b in [lang_a, "en"]: + if eval_data_output_path: + eval_input_data_examples = processor.get_dev_examples( + data_dir, os.path.join(path_pattern.format(lang_a, lang_b))) + + num_eval_data = len(eval_input_data_examples) + logging.info("Processing %d dev examples of %s-en.%s", num_eval_data, + lang_a, lang_b) + output_file = os.path.join( + eval_data_output_path, + "{}-en-{}.{}.tfrecords".format(lang_a, lang_b, "dev")) + classifier_data_lib.file_based_convert_examples_to_features( + eval_input_data_examples, None, max_seq_length, tokenizer, + output_file, None) + meta_data["number_eval_data"][f"{lang_a}-en.{lang_b}"] = num_eval_data + + if test_data_output_path: + test_input_data_examples = processor.get_test_examples( + data_dir, os.path.join(path_pattern.format(lang_a, lang_b))) + + num_test_data = len(test_input_data_examples) + logging.info("Processing %d test examples of %s-en.%s", num_test_data, + lang_a, lang_b) + output_file = os.path.join( + test_data_output_path, + "{}-en-{}.{}.tfrecords".format(lang_a, lang_b, "test")) + classifier_data_lib.file_based_convert_examples_to_features( + test_input_data_examples, None, max_seq_length, tokenizer, + output_file, None) + meta_data["number_test_data"][f"{lang_a}-en.{lang_b}"] = num_test_data + + return meta_data diff --git a/official/nlp/data/tagging_data_lib.py b/official/nlp/data/tagging_data_lib.py new file mode 100644 index 0000000000000000000000000000000000000000..c97fd9382f493209f61b0672c04b544259164372 --- /dev/null +++ b/official/nlp/data/tagging_data_lib.py @@ -0,0 +1,346 @@ +# Copyright 2020 The TensorFlow Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# ============================================================================== +"""Library to process data for tagging task such as NER/POS.""" +import collections +import os + +from absl import logging +import tensorflow as tf + +from official.nlp.data import classifier_data_lib + +# A negative label id for the padding label, which will not contribute +# to loss/metrics in training. +_PADDING_LABEL_ID = -1 + +# The special unknown token, used to substitute a word which has too many +# subwords after tokenization. +_UNK_TOKEN = "[UNK]" + + +class InputExample(object): + """A single training/test example for token classification.""" + + def __init__(self, sentence_id, words=None, label_ids=None): + """Constructs an InputExample.""" + self.sentence_id = sentence_id + self.words = words if words else [] + self.label_ids = label_ids if label_ids else [] + + def add_word_and_label_id(self, word, label_id): + """Adds word and label_id pair in the example.""" + self.words.append(word) + self.label_ids.append(label_id) + + +def _read_one_file(file_name, label_list): + """Reads one file and returns a list of `InputExample` instances.""" + lines = tf.io.gfile.GFile(file_name, "r").readlines() + examples = [] + label_id_map = {label: i for i, label in enumerate(label_list)} + sentence_id = 0 + example = InputExample(sentence_id=0) + for line in lines: + line = line.strip("\n") + if line: + # The format is: \t