README.md 6.96 KB
Newer Older
Hongkun Yu's avatar
Hongkun Yu committed
1
2
3
4
5
6
7
8
9
10
11
12
13
14
# Copyright 2021 The TensorFlow Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

15
# Layers
Hongkun Yu's avatar
Hongkun Yu committed
16

17
Layers are the fundamental building blocks for NLP models. They can be used to
Chen Chen's avatar
Chen Chen committed
18
assemble new `tf.keras` layers or models.
19

Hongkun Yu's avatar
Hongkun Yu committed
20
*   [MultiHeadAttention](attention.py) implements an optionally masked attention
21
    between query, key, value tensors as described in
Hongkun Yu's avatar
Hongkun Yu committed
22
23
    ["Attention Is All You Need"](https://arxiv.org/abs/1706.03762). If
    `from_tensor` and `to_tensor` are the same, then this is self-attention.
24

25
26
27
28
*   [BigBirdAttention](bigbird_attention.py) implements a sparse attention
    mechanism that reduces this quadratic dependency to linear described in
    ["Big Bird: Transformers for Longer Sequences"](https://arxiv.org/abs/2007.14062).

Hongkun Yu's avatar
Hongkun Yu committed
29
30
*   [CachedAttention](attention.py) implements an attention layer with cache
    used for auto-agressive decoding.
31

32
33
34
35
36
37
38
39
40
*   [KernelAttention](kernel_attention.py) implements a group of attention
    mechansim that express the self-attention as a linear dot-product of
    kernel feature maps and make use of the associativity property of
    matrix products to reduce the complexity from quadratic to linear. The
    implementation includes methods described in ["Transformers are RNNs:
    Fast Autoregressive Transformers with Linear Attention"](https://arxiv.org/abs/2006.16236),
    ["Rethinking Attention with Performers"](https://arxiv.org/abs/2009.14794),
    ["Random Feature Attention"](https://openreview.net/pdf?id=QtTKTdVrFBB).

A. Unique TensorFlower's avatar
A. Unique TensorFlower committed
41
42
43
44
45
46
*   [MatMulWithMargin](mat_mul_with_margin.py) implements a matrix
    multiplication with margin layer used for training retrieval / ranking
    tasks, as described in ["Improving Multilingual Sentence Embedding using
    Bi-directional Dual Encoder with Additive Margin
    Softmax"](https://www.ijcai.org/Proceedings/2019/0746.pdf).

47
48
49
50
*   [MultiChannelAttention](multi_channel_attention.py) implements an variant of
    multi-head attention which can be used to merge multiple streams for
    cross-attentions.

Hongkun Yu's avatar
Hongkun Yu committed
51
52
53
*   [TalkingHeadsAttention](talking_heads_attention.py) implements the talking
    heads attention, as decribed in
    ["Talking-Heads Attention"](https://arxiv.org/abs/2003.02436).
54

Hongkun Yu's avatar
Hongkun Yu committed
55
56
57
*   [Transformer](transformer.py) implements an optionally masked transformer as
    described in
    ["Attention Is All You Need"](https://arxiv.org/abs/1706.03762).
58

Hongkun Yu's avatar
Hongkun Yu committed
59
*   [TransformerDecoderBlock](transformer.py) TransformerDecoderBlock is made up
A. Unique TensorFlower's avatar
A. Unique TensorFlower committed
60
61
    of self multi-head attention, cross multi-head attention and feedforward
    network.
Hongkun Yu's avatar
Hongkun Yu committed
62

Jeremiah Liu's avatar
Jeremiah Liu committed
63
64
65
66
*   [RandomFeatureGaussianProcess](gaussian_process.py) implements random
    feature-based Gaussian process described in ["Random Features for
     Large-Scale Kernel Machines"](https://people.eecs.berkeley.edu/~brecht/papers/07.rah.rec.nips.pdf).

Hongkun Yu's avatar
Hongkun Yu committed
67
68
69
*   [ReZeroTransformer](rezero_transformer.py) implements Transformer with
    ReZero described in
    ["ReZero is All You Need: Fast Convergence at Large Depth"](https://arxiv.org/abs/2003.04887).
70

Hongkun Yu's avatar
Hongkun Yu committed
71
72
*   [OnDeviceEmbedding](on_device_embedding.py) implements efficient embedding
    lookups designed for TPU-based models.
Le Hou's avatar
Le Hou committed
73

Hongkun Yu's avatar
Hongkun Yu committed
74
75
76
*   [PositionalEmbedding](position_embedding.py) creates a positional embedding
    as described in ["BERT: Pre-training of Deep Bidirectional Transformers for
    Language Understanding"](https://arxiv.org/abs/1810.04805).
77

Hongkun Yu's avatar
Hongkun Yu committed
78
79
*   [SelfAttentionMask](self_attention_mask.py) creates a 3D attention mask from
    a 2D tensor mask.
80

Jeremiah Liu's avatar
Jeremiah Liu committed
81
82
83
84
85
*   [SpectralNormalization](spectral_normalization.py) implements a tf.Wrapper
    that applies spectral normalization regularization to a given layer. See
    [Spectral Norm Regularization for Improving the Generalizability of
     Deep Learning](https://arxiv.org/abs/1705.10941)

Hongkun Yu's avatar
Hongkun Yu committed
86
87
88
89
90
91
*   [MaskedSoftmax](masked_softmax.py) implements a softmax with an optional
    masking input. If no mask is provided to this layer, it performs a standard
    softmax; however, if a mask tensor is applied (which should be 1 in
    positions where the data should be allowed through, and 0 where the data
    should be masked), the output will have masked positions set to
    approximately zero.
92

Hongkun Yu's avatar
Hongkun Yu committed
93
94
*   [`MaskedLM`](masked_lm.py) implements a masked language model. It assumes
    the embedding table variable is passed to it.
Hongkun Yu's avatar
Hongkun Yu committed
95

Hongkun Yu's avatar
Hongkun Yu committed
96
97
*   [ClassificationHead](cls_head.py) A pooling head over a sequence of
    embeddings, commonly used by classification tasks.
Chen Chen's avatar
Chen Chen committed
98

Jeremiah Liu's avatar
Jeremiah Liu committed
99
100
101
102
103
*   [GaussianProcessClassificationHead](cls_head.py) A spectral-normalized
    neural Gaussian process (SNGP)-based classification head as described in
    ["Simple and Principled Uncertainty Estimation with Deterministic Deep
     Learning via Distance Awareness"](https://arxiv.org/abs/2006.10108).

Chen Chen's avatar
Chen Chen committed
104
105
106
*   [GatedFeedforward](gated_feedforward.py) implements the gated linear layer
    feedforward as described in
    ["GLU Variants Improve Transformer"](https://arxiv.org/abs/2002.05202).
Allen Wang's avatar
Allen Wang committed
107
108
109

*   [MultiHeadRelativeAttention](relative_attention.py) implements a variant
    of multi-head attention with support for relative position encodings as
110
111
    described in ["Transformer-XL: Attentive Language Models Beyond a
    Fixed-Length Context"](https://arxiv.org/abs/1901.02860). This also has
Allen Wang's avatar
Allen Wang committed
112
    extended support for segment-based attention, a re-parameterization
113
114
    introduced in ["XLNet: Generalized Autoregressive Pretraining for Language
    Understanding"](https://arxiv.org/abs/1906.08237).
Allen Wang's avatar
Allen Wang committed
115
116

*   [TwoStreamRelativeAttention](relative_attention.py) implements a variant
117
118
    of multi-head relative attention as described in ["XLNet: Generalized
    Autoregressive Pretraining for Language Understanding"]
Allen Wang's avatar
Allen Wang committed
119
120
121
122
    (https://arxiv.org/abs/1906.08237). This takes in a query and content
    stream and applies self attention.

*   [TransformerXL](transformer_xl.py) implements Transformer XL introduced in
123
    ["Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context"]
Allen Wang's avatar
Allen Wang committed
124
125
126
127
    (https://arxiv.org/abs/1901.02860). This contains `TransformerXLBlock`, a
    block containing either one or two stream relative self-attention as well as
    subsequent feedforward networks. It also contains `TransformerXL`, which
    contains attention biases as well as multiple `TransformerXLBlocks`.
128
129
130
131
132

*   [MobileBertEmbedding](mobile_bert_layers.py) and
    [MobileBertTransformer](mobile_bert_layers.py) implement the embedding layer
    and also transformer layer proposed in the
    [MobileBERT paper](https://arxiv.org/pdf/2004.02984.pdf).
Chen Chen's avatar
Chen Chen committed
133
134
135
136
137

*   [BertPackInputs](text_layers.py) and
    [BertTokenizer](text_layers.py) and [SentencepieceTokenizer](text_layers.py)
    implements the layer to tokenize raw text and pack them into the inputs for
    BERT models.