Add README files in nlp/modeling folder.

PiperOrigin-RevId: 299126470

Add README files in nlp/modeling folder.
PiperOrigin-RevId: 299126470
ab6d40ca · Jing Li · A. Unique TensorFlower · cf01596c · ab6d40ca · ab6d40ca
Commit ab6d40ca authored Mar 05, 2020 by Jing Li Committed by A. Unique TensorFlower Mar 05, 2020
6 changed files
--- a/official/nlp/README.md
+++ b/official/nlp/README.md
-# TensorFlow Natural Language Processing Models
+# TensorFlow Natural Language Processing Modelling Toolkit

-tensorflow/models/official/nlp is a library of state-of-the-art models for
-Natural Language Processing (NLP).
+tensorflow/models/official/nlp provides a [modeling library](modeling) for constructing
+NLP model achitectures, as well as TF2 reference implementations for
+state-of-the-art models.

-The library currently contains TensorFlow 2.x implementations, pre-trained
-model weights, usage scripts and conversion utilities for the following models:
+The repository contains the following models, with implementations, pre-trained
+model weights, usage scripts and conversion utilities:

 * [Bert](bert)
-
 * [Albert](albert)
-
 * [XLNet](xlnet)
-
 * [Transformer for translation](transformer)
+
+Addtional features:
+
+* Distributed trainable on both multi-GPU and TPU
+* e2e training for custom models, including both pretraining and finetuning.
+
+
+
--- a/official/nlp/modeling/README.md
+++ b/official/nlp/modeling/README.md
+# NLP Modeling Library
+
+This libary provides a set of Keras primitives (Layers, Networks, and Models)
+that can be assembled into transformer-based models. They are
+flexible, validated, interoperable, and both TF1 and TF2 compatible.
+
+* [`layers`](layers) are the fundamental building blocks for NLP models.
+They can be used to assemble new layers, networks, or models.
+
+* [`networks`](networks) are combinations of layers (and possibly other networks). They are sub-units of models that would not be trained alone. They
+encapsulate common network structures like a classification head
+or a transformer encoder into an easily handled object with a
+standardized configuration.
+
+* [`models`](models) are combinations of layers and networks that would be trained. Pre-built canned models are provided as both convenience functions and canonical examples.
+
+* [`losses`](losses) contains common loss computation used in NLP tasks.
+
+Besides the pre-defined primitives, it also provides scaffold classes to allow
+easy experimentation with noval achitectures, e.g., you don’t need to fork a whole Transformer object to try a different kind of attention primitive, for instance.
+
+* [`TransformerScaffold`](layers/transformer_scaffold.py) implements the
+Transformer from ["Attention Is All You Need"]
+(https://arxiv.org/abs/1706.03762), with a customizable attention layer
+option. Users can pass a class to `attention_cls` and associated config to
+`attention_cfg`, in which case the scaffold will instantiate the class with
+the config, or pass a class instance to `attention_cls`.
+
+* [`EncoderScaffold`](networks/encoder_scaffold.py) implements the transformer
+encoder from ["BERT: Pre-training of Deep Bidirectional Transformers for
+Language Understanding"](https://arxiv.org/abs/1810.04805), with customizable
+embedding subnetwork (which will replace the standard embedding logic) and/or a
+custom hidden layer (which will replace the Transformer instantiation in the
+encoder).
+
+BERT and ALBERT models in this repo are implemented using this library. Code examples can be found in the corresponding model folder.
+
+
+
+
+
+
+
--- a/official/nlp/modeling/layers/README.md
+++ b/official/nlp/modeling/layers/README.md
+# Layers
+Layers are the fundamental building blocks for NLP models. They can be used to
+assemble new layers, networks, or models.
+
+* [DenseEinsum](dense_einsum.py) implements a feedforward network using tf.einsum. This layer contains the einsum op, the associated weight, and the
+logic required to generate the einsum expression for the given initialization
+parameters.
+
+* [Attention](attention.py) implements an optionally masked attention between two tensors, from_tensor and to_tensor, as described in ["Attention Is All You Need"](https://arxiv.org/abs/1706.03762). If `from_tensor` and `to_tensor` are the same, then this is self-attention.
+
+* [CachedAttention](attention.py) implements an attention layer with cache used
+for auto-agressive decoding.
+
+* [Transformer](transformer.py) implements an optionally masked transformer as
+described in ["Attention Is All You Need"](https://arxiv.org/abs/1706.03762).
+
+* [OnDeviceEmbedding](on_device_embedding.py) implements efficient embedding lookups designed for TPU-based models.
+
+* [PositionalEmbedding](position_embedding.py) creates a positional embedding
+  as described in ["BERT: Pre-training
+  of Deep Bidirectional Transformers for Language Understanding"]
+  (https://arxiv.org/abs/1810.04805).
+
+* [SelfAttentionMask](self_attention_mask.py) creates a 3D attention mask from a 2D tensor mask.
+
+* [MaskedSoftmax](masked_softmax.py) implements a softmax with an optional masking input. If no mask is provided to this layer, it performs a standard softmax; however, if a mask tensor is applied (which should be 1 in positions where the data should be allowed through, and 0 where the data should be masked), the output will have masked positions set to approximately zero.
--- a/official/nlp/modeling/losses/README.md
+++ b/official/nlp/modeling/losses/README.md
+# Losses
+
+Losses contains common loss computation used in NLP tasks.
+
+* `weighted_sparse_categorical_crossentropy_loss` computes per-batch sparse
+categorical crossentropy loss.
+
+* `weighted_sparse_categorical_crossentropy_per_example_loss` computes
+per-example sparse categorical crossentropy loss.
--- a/official/nlp/modeling/models/README.md
+++ b/official/nlp/modeling/models/README.md
+# Models
+
+Models are combinations of layers and networks that would be trained.
+
+Several pre-built canned models are provided to train encoder networks. These
+models are intended as both convenience functions and canonical examples.
+
+* [`BertClassifier`](bert_classifier.py) implements a simple classification
+model containing a single classification head using the Classification network.
+
+* [`BertSpanLabeler`](bert_span_labeler.py) implementats a simple single-span
+start-end predictor (that is, a model that predicts two values: a start token
+index and an end token index), suitable for SQuAD-style tasks.
+
+* [`BertPretrainer`](bert_pretrainer.py) implements a masked LM and a
+classification head using the Masked LM and Classification networks,
+respectively.
--- a/official/nlp/modeling/networks/README.md
+++ b/official/nlp/modeling/networks/README.md
+# Networks
+
+Networks are combinations of layers (and possibly other networks). They are sub-units of models that would not be trained alone. It
+encapsulates common network structures like a classification head
+or a transformer encoder into an easily handled object with a
+standardized configuration.
+
+* [`TransformerEncoder`](transformer_encoder.py) implements a bi-directional
+Transformer-based encoder as described in ["BERT: Pre-training of Deep
+Bidirectional Transformers for Language Understanding"](https://arxiv.org/abs/1810.04805). It includes the embedding lookups,
+transformer layers and pooling layer.
+
+* [`AlbertTransformerEncoder`](albert_transformer_encoder.py) implements a
+Transformer-encoder described in the paper ["ALBERT: A Lite BERT for
+Self-supervised Learning of Language Representations]
+(https://arxiv.org/abs/1909.11942). Compared with [BERT](https://arxiv.org/abs/1810.04805), ALBERT refactorizes embedding parameters
+into two smaller matrices and shares parameters across layers.
+
+* [`MaskedLM`](masked_lm.py) implements a masked language model for BERT pretraining. It assumes that the network being passed has a `get_embedding_table()` method.
+
+* [`Classification`](classification.py) contains a single hidden layer, and is intended for use as a classification head.
+
+* [`SpanLabeling`](span_labeling.py) implements a single-span labeler (that is, a prediction head that can predict one start and end index per batch item) based on a single dense hidden layer. It can be used in the SQuAD task.
+