README.md 7.06 KB
Newer Older
1
# Layers
Hongkun Yu's avatar
Hongkun Yu committed
2

3
Layers are the fundamental building blocks for NLP models. They can be used to
Chen Chen's avatar
Chen Chen committed
4
assemble new `tf.keras` layers or models.
5

Hongkun Yu's avatar
Hongkun Yu committed
6
*   [MultiHeadAttention](attention.py) implements an optionally masked attention
7
    between query, key, value tensors as described in
Hongkun Yu's avatar
Hongkun Yu committed
8
9
    ["Attention Is All You Need"](https://arxiv.org/abs/1706.03762). If
    `from_tensor` and `to_tensor` are the same, then this is self-attention.
10

11
12
13
14
*   [BigBirdAttention](bigbird_attention.py) implements a sparse attention
    mechanism that reduces this quadratic dependency to linear described in
    ["Big Bird: Transformers for Longer Sequences"](https://arxiv.org/abs/2007.14062).

Hongkun Yu's avatar
Hongkun Yu committed
15
16
*   [CachedAttention](attention.py) implements an attention layer with cache
    used for auto-agressive decoding.
17

18
19
20
21
22
23
24
25
26
*   [KernelAttention](kernel_attention.py) implements a group of attention
    mechansim that express the self-attention as a linear dot-product of
    kernel feature maps and make use of the associativity property of
    matrix products to reduce the complexity from quadratic to linear. The
    implementation includes methods described in ["Transformers are RNNs:
    Fast Autoregressive Transformers with Linear Attention"](https://arxiv.org/abs/2006.16236),
    ["Rethinking Attention with Performers"](https://arxiv.org/abs/2009.14794),
    ["Random Feature Attention"](https://openreview.net/pdf?id=QtTKTdVrFBB).

A. Unique TensorFlower's avatar
A. Unique TensorFlower committed
27
28
29
30
31
32
*   [MatMulWithMargin](mat_mul_with_margin.py) implements a matrix
    multiplication with margin layer used for training retrieval / ranking
    tasks, as described in ["Improving Multilingual Sentence Embedding using
    Bi-directional Dual Encoder with Additive Margin
    Softmax"](https://www.ijcai.org/Proceedings/2019/0746.pdf).

33
34
35
36
*   [MultiChannelAttention](multi_channel_attention.py) implements an variant of
    multi-head attention which can be used to merge multiple streams for
    cross-attentions.

Hongkun Yu's avatar
Hongkun Yu committed
37
38
39
*   [TalkingHeadsAttention](talking_heads_attention.py) implements the talking
    heads attention, as decribed in
    ["Talking-Heads Attention"](https://arxiv.org/abs/2003.02436).
40

Hongkun Yu's avatar
Hongkun Yu committed
41
42
43
*   [Transformer](transformer.py) implements an optionally masked transformer as
    described in
    ["Attention Is All You Need"](https://arxiv.org/abs/1706.03762).
44

Hongkun Yu's avatar
Hongkun Yu committed
45
*   [TransformerDecoderBlock](transformer.py) TransformerDecoderBlock is made up
A. Unique TensorFlower's avatar
A. Unique TensorFlower committed
46
47
    of self multi-head attention, cross multi-head attention and feedforward
    network.
Hongkun Yu's avatar
Hongkun Yu committed
48

Jeremiah Liu's avatar
Jeremiah Liu committed
49
50
51
52
*   [RandomFeatureGaussianProcess](gaussian_process.py) implements random
    feature-based Gaussian process described in ["Random Features for
     Large-Scale Kernel Machines"](https://people.eecs.berkeley.edu/~brecht/papers/07.rah.rec.nips.pdf).

53
54
55
56
57
58
59
60
*   [ReuseMultiHeadAttention](reuse_attention.py) supports passing
    attention scores to be reused and avoid recomputation described in
    ["Leveraging redundancy in attention with Reuse Transformers"](https://arxiv.org/abs/2110.06821).

*   [ReuseTransformer](reuse_transformer.py) supports reusing attention scores
    from lower layers in higher layers to avoid recomputing attention scores
    described in ["Leveraging redundancy in attention with Reuse Transformers"](https://arxiv.org/abs/2110.06821).

Hongkun Yu's avatar
Hongkun Yu committed
61
62
63
*   [ReZeroTransformer](rezero_transformer.py) implements Transformer with
    ReZero described in
    ["ReZero is All You Need: Fast Convergence at Large Depth"](https://arxiv.org/abs/2003.04887).
64

Hongkun Yu's avatar
Hongkun Yu committed
65
66
*   [OnDeviceEmbedding](on_device_embedding.py) implements efficient embedding
    lookups designed for TPU-based models.
Le Hou's avatar
Le Hou committed
67

Hongkun Yu's avatar
Hongkun Yu committed
68
69
70
*   [PositionalEmbedding](position_embedding.py) creates a positional embedding
    as described in ["BERT: Pre-training of Deep Bidirectional Transformers for
    Language Understanding"](https://arxiv.org/abs/1810.04805).
71

Hongkun Yu's avatar
Hongkun Yu committed
72
73
*   [SelfAttentionMask](self_attention_mask.py) creates a 3D attention mask from
    a 2D tensor mask.
74

Jeremiah Liu's avatar
Jeremiah Liu committed
75
76
77
78
79
*   [SpectralNormalization](spectral_normalization.py) implements a tf.Wrapper
    that applies spectral normalization regularization to a given layer. See
    [Spectral Norm Regularization for Improving the Generalizability of
     Deep Learning](https://arxiv.org/abs/1705.10941)

Hongkun Yu's avatar
Hongkun Yu committed
80
81
82
83
84
85
*   [MaskedSoftmax](masked_softmax.py) implements a softmax with an optional
    masking input. If no mask is provided to this layer, it performs a standard
    softmax; however, if a mask tensor is applied (which should be 1 in
    positions where the data should be allowed through, and 0 where the data
    should be masked), the output will have masked positions set to
    approximately zero.
86

Hongkun Yu's avatar
Hongkun Yu committed
87
88
*   [`MaskedLM`](masked_lm.py) implements a masked language model. It assumes
    the embedding table variable is passed to it.
Hongkun Yu's avatar
Hongkun Yu committed
89

Hongkun Yu's avatar
Hongkun Yu committed
90
91
*   [ClassificationHead](cls_head.py) A pooling head over a sequence of
    embeddings, commonly used by classification tasks.
Chen Chen's avatar
Chen Chen committed
92

Jeremiah Liu's avatar
Jeremiah Liu committed
93
94
95
96
97
*   [GaussianProcessClassificationHead](cls_head.py) A spectral-normalized
    neural Gaussian process (SNGP)-based classification head as described in
    ["Simple and Principled Uncertainty Estimation with Deterministic Deep
     Learning via Distance Awareness"](https://arxiv.org/abs/2006.10108).

Chen Chen's avatar
Chen Chen committed
98
99
100
*   [GatedFeedforward](gated_feedforward.py) implements the gated linear layer
    feedforward as described in
    ["GLU Variants Improve Transformer"](https://arxiv.org/abs/2002.05202).
Allen Wang's avatar
Allen Wang committed
101
102
103

*   [MultiHeadRelativeAttention](relative_attention.py) implements a variant
    of multi-head attention with support for relative position encodings as
104
105
    described in ["Transformer-XL: Attentive Language Models Beyond a
    Fixed-Length Context"](https://arxiv.org/abs/1901.02860). This also has
Allen Wang's avatar
Allen Wang committed
106
    extended support for segment-based attention, a re-parameterization
107
108
    introduced in ["XLNet: Generalized Autoregressive Pretraining for Language
    Understanding"](https://arxiv.org/abs/1906.08237).
Allen Wang's avatar
Allen Wang committed
109
110

*   [TwoStreamRelativeAttention](relative_attention.py) implements a variant
111
112
    of multi-head relative attention as described in ["XLNet: Generalized
    Autoregressive Pretraining for Language Understanding"]
Allen Wang's avatar
Allen Wang committed
113
114
115
116
    (https://arxiv.org/abs/1906.08237). This takes in a query and content
    stream and applies self attention.

*   [TransformerXL](transformer_xl.py) implements Transformer XL introduced in
117
    ["Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context"]
Allen Wang's avatar
Allen Wang committed
118
119
120
121
    (https://arxiv.org/abs/1901.02860). This contains `TransformerXLBlock`, a
    block containing either one or two stream relative self-attention as well as
    subsequent feedforward networks. It also contains `TransformerXL`, which
    contains attention biases as well as multiple `TransformerXLBlocks`.
122
123
124
125
126

*   [MobileBertEmbedding](mobile_bert_layers.py) and
    [MobileBertTransformer](mobile_bert_layers.py) implement the embedding layer
    and also transformer layer proposed in the
    [MobileBERT paper](https://arxiv.org/pdf/2004.02984.pdf).
Chen Chen's avatar
Chen Chen committed
127
128
129
130
131

*   [BertPackInputs](text_layers.py) and
    [BertTokenizer](text_layers.py) and [SentencepieceTokenizer](text_layers.py)
    implements the layer to tokenize raw text and pack them into the inputs for
    BERT models.
132
133
134
135
    
*   [TransformerEncoderBlock](transformer_encoder_block.py) implements
    an optionally masked transformer as described in
    ["Attention Is All You Need"](https://arxiv.org/abs/1706.03762).