Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in / Register
Toggle navigation
Menu
Open sidebar
OpenDAS
dgl
Commits
10b104dd
Commit
10b104dd
authored
Nov 30, 2019
by
John Andrilla
Committed by
Minjie Wang
Dec 01, 2019
Browse files
[Doc] Transformer tutorial, edit for readability (#1025)
Edit pass for grammar and style
parent
fdc58a89
Changes
1
Hide whitespace changes
Inline
Side-by-side
Showing
1 changed file
with
78 additions
and
80 deletions
+78
-80
tutorials/models/4_old_wines/7_transformer.py
tutorials/models/4_old_wines/7_transformer.py
+78
-80
No files found.
tutorials/models/4_old_wines/7_transformer.py
View file @
10b104dd
"""
.. _model-transformer:
Transformer
T
utorial
Transformer
t
utorial
====================
**Author**: Zihao Ye, Jinjing Zhou, Qipeng Guo, Quan Gan, Zheng Zhang
"""
################################################################################################
# In this tutorial, you learn about a simplified implementation of the Transformer model.
# You can see highlights of the most important design points. For instance, there is
# only single-head attention. The complete code can be found
# `here <https://github.com/dmlc/dgl/tree/master/examples/pytorch/transformer>`__.
#
# The overall structure is similar to the one from the research papaer `Annotated
# Transformer <http://nlp.seas.harvard.edu/2018/04/03/attention.html>`__.
#
# The Transformer model, as a replacement of CNN/RNN architecture for
# sequence modeling, was introduced in
Google’s
paper: `Attention is All
# sequence modeling, was introduced in
the research
paper: `Attention is All
# You Need <https://arxiv.org/pdf/1706.03762.pdf>`__. It improved the
# state of the art for machine translation as well as natural language
# inference task
# (`GPT <https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf>`__).
# Recent work on pre-training Transformer with large scale corpus
# (`BERT <https://arxiv.org/pdf/1810.04805.pdf>`__) supports that it is
# capable of learning high
quality semantic representation.
# capable of learning high
-
quality semantic representation.
#
# The interesting part of Transformer is its extensive employment of
# attention. The classic use of attention comes from machine translation
...
...
@@ -27,10 +35,10 @@ Transformer Tutorial
# different from RNN-based model, where words (in the source sentence) are
# combined along the chain, which is thought to be too constrained.
#
# Attention
L
ayer of Transformer
# Attention
l
ayer of Transformer
# ------------------------------
#
# In
A
ttention
L
ayer of Transformer, for each node the module learns to
# In
the a
ttention
l
ayer of Transformer, for each node the module learns to
# assign weights on its in-coming edges. For node pair :math:`(i, j)`
# (from :math:`i` to :math:`j`) with node
# :math:`x_i, x_j \in \mathbb{R}^n`, the score of their connection is
...
...
@@ -56,7 +64,7 @@ Transformer Tutorial
#
# The score is then used to compute the sum of the incoming values,
# normalized over the weights of edges, stored in :math:`\textrm{wv}`.
# Then
we
apply an affine layer to :math:`\textrm{wv}` to get the output
# Then apply an affine layer to :math:`\textrm{wv}` to get the output
# :math:`o`:
#
# .. math::
...
...
@@ -66,11 +74,11 @@ Transformer Tutorial
# \textrm{wv}_i = \sum_{(k, i)\in E} w_{ki} v_k \\
# o = W_o\cdot \textrm{wv} \\
#
# Multi-
H
ead
A
ttention
L
ayer
# Multi-
h
ead
a
ttention
l
ayer
# ~~~~~~~~~~~~~~~~~~~~~~~~~~
#
# In Transformer, attention is *multi-headed*. A head is very much like a
# channel in a convolutional network. The
M
ulti-
H
ead
A
ttention consist of
# channel in a convolutional network. The
m
ulti-
h
ead
a
ttention consist
s
of
# multiple attention heads, in which each head refers to a single
# attention module. :math:`\textrm{wv}^{(i)}` for all the heads are
# concatenated and mapped to output :math:`o` with an affine layer:
...
...
@@ -80,8 +88,8 @@ Transformer Tutorial
#
# o = W_o \cdot \textrm{concat}\left(\left[\textrm{wv}^{(0)}, \textrm{wv}^{(1)}, \cdots, \textrm{wv}^{(h)}\right]\right)
#
# The code below wraps necessary components for
M
ulti-
H
ead
A
ttention, and
# provide two interfaces
:
# The code below wraps necessary components for
m
ulti-
h
ead
a
ttention, and
# provide
s
two interfaces
.
#
# - ``get`` maps state ‘x’, to query, key and value, which is required by
# following steps(\ ``propagate_attention``).
...
...
@@ -117,24 +125,18 @@ Transformer Tutorial
# batch_size = x.shape[0]
# return self.linears[3](x.view(batch_size, -1))
#
# In this tutorial, we show a simplified version of the implementation in
# order to highlight the most important design points (for instance we
# only show single-head attention); the complete code can be found
# `here <https://github.com/dmlc/dgl/tree/master/examples/pytorch/transformer>`__.
# The overall structure is similar to the one from `The Annotated
# Transformer <http://nlp.seas.harvard.edu/2018/04/03/attention.html>`__.
#
# How DGL
I
mplements Transformer with a
G
raph
N
eural
N
etwork
# How DGL
i
mplements Transformer with a
g
raph
n
eural
n
etwork
# ----------------------------------------------------------
#
#
We offer
a different perspective of Transformer by treating the
#
You get
a different perspective of Transformer by treating the
# attention as edges in a graph and adopt message passing on the edges to
# induce the appropriate processing.
#
# Graph
S
tructure
# Graph
s
tructure
# ~~~~~~~~~~~~~~~
#
#
We c
onstruct the graph by mapping tokens of the source and target
#
C
onstruct the graph by mapping tokens of the source and target
# sentence to nodes. The complete Transformer graph is made up of three
# subgraphs:
#
...
...
@@ -151,17 +153,17 @@ Transformer Tutorial
#
# The full picture looks like this: |image3|
#
#
We p
re-build the graphs in dataset preparation stage.
#
P
re-build the graphs in dataset preparation stage.
#
# Message
P
assing
# Message
p
assing
# ~~~~~~~~~~~~~~~
#
# Once
we
define
d
the graph structure,
we can
move on to defining the
# Once
you
define the graph structure, move on to defining the
# computation for message passing.
#
# Assuming that
we
have already computed all the queries :math:`q_i`, keys
# Assuming that
you
have already computed all the queries :math:`q_i`, keys
# :math:`k_i` and values :math:`v_i`. For each node :math:`i` (no matter
# whether it is a source token or target token),
we
can decompose the
# whether it is a source token or target token),
you
can decompose the
# attention computation into two steps:
#
# 1. **Message computation:** Compute attention score
...
...
@@ -173,10 +175,10 @@ Transformer Tutorial
# 2. **Message aggregation:** Aggregate the values :math:`v_j` from all
# :math:`j` according to the scores :math:`\mathrm{score}_{ij}`.
#
#
Naiv
e
I
mplementation
#
Simpl
e
i
mplementation
# ^^^^^^^^^^^^^^^^^^^^
#
# Message
C
omputation
# Message
c
omputation
# '''''''''''''''''''
#
# Compute ``score`` and send source node’s ``v`` to destination’s mailbox
...
...
@@ -188,7 +190,7 @@ Transformer Tutorial
# .sum(-1, keepdim=True)),
# 'v': edges.src['v']}
#
# Message
A
ggregation
# Message
a
ggregation
# '''''''''''''''''''
#
# Normalize over all in-edges and weighted sum to get output
...
...
@@ -215,8 +217,8 @@ Transformer Tutorial
# Speeding up with built-in functions
# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
#
# To speed up the message passing process,
we utiliz
e DGL’s builtin
# function, including:
# To speed up the message passing process,
us
e DGL’s built
-
in
# function
s
, including:
#
# - ``fn.src_mul_egdes(src_field, edges_field, out_field)`` multiplies
# source’s attribute and edges attribute, and send the result to the
...
...
@@ -226,10 +228,10 @@ Transformer Tutorial
# - ``fn.sum(edges_field, out_field)`` sums up
# edge’s attribute and sends aggregation to destination node’s mailbox.
#
# Here
we
assemble those built-in function into ``propagate_attention``,
# which is also the main graph operation function in
our
final
# implementation. To accelerate
, we
break the ``softmax`` operation into
# the following steps. Recall that for each head there are two phases
:
# Here
, you
assemble those built-in function
s
into ``propagate_attention``,
# which is also the main graph operation function in
the
final
# implementation. To accelerate
it,
break the ``softmax`` operation into
# the following steps. Recall that for each head there are two phases
.
#
# 1. Compute attention score by multiply src node’s ``k`` and dst node’s
# ``q``
...
...
@@ -284,7 +286,7 @@ Transformer Tutorial
# [fn.src_mul_edge('v', 'score', 'v'), fn.copy_edge('score', 'score')],
# [fn.sum('v', 'wv'), fn.sum('score', 'z')])
#
# Preprocessing and
P
ostprocessing
# Preprocessing and
p
ostprocessing
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#
# In Transformer, data needs to be pre- and post-processed before and
...
...
@@ -390,12 +392,12 @@ Transformer Tutorial
# .. note::
#
# The sublayer connection part is little bit different from the
# original paper. However
our
implementation is the same as `The Annotated
# original paper. However
, this
implementation is the same as `The Annotated
# Transformer <http://nlp.seas.harvard.edu/2018/04/03/attention.html>`__
# and
# `OpenNMT <https://github.com/OpenNMT/OpenNMT-py/blob/cd29c1dbfb35f4a2701ff52a1bf4e5bdcf02802e/onmt/encoders/transformer.py>`__.
#
# Main class of Transformer
G
raph
# Main class of Transformer
g
raph
# -------------------------------
#
# The processing flow of Transformer can be seen as a 2-stage
...
...
@@ -470,32 +472,31 @@ Transformer Tutorial
#
# .. note::
#
# By calling ``update_graph`` function,
we
can
“DIY
our own
# Transformer
”
on any subgraphs with nearly the same code. This
# By calling ``update_graph`` function,
you
can
create y
our own
# Transformer on any subgraphs with nearly the same code. This
# flexibility enables us to discover new, sparse structures (c.f. local attention
# mentioned `here <https://arxiv.org/pdf/1508.04025.pdf>`__). Note in
our
# implementation
we does no
t use mask or padding, which makes the logic
# mentioned `here <https://arxiv.org/pdf/1508.04025.pdf>`__). Note in
this
# implementation
you don'
t use mask or padding, which makes the logic
# more clear and saves memory. The trade-off is that the implementation is
# slower
; we will improve with future DGL optimizations
.
# slower.
#
# Training
# --------
#
# This tutorial does not cover several other techniques such as Label
# Smoothing and Noam Optimizations mentioned in the original paper. For
# detailed description about these modules,
we recommend you to
read `The
# detailed description about these modules, read `The
# Annotated
# Transformer <http://nlp.seas.harvard.edu/2018/04/03/attention.html>`__
# written by Harvard NLP team.
#
# Task and the
D
ataset
# Task and the
d
ataset
# ~~~~~~~~~~~~~~~~~~~~
#
# The Transformer is a general framework for a variety of NLP tasks. In
# this tutorial we only focus on the sequence to sequence learning: it’s a
# typical case to illustrate how it works.
# The Transformer is a general framework for a variety of NLP tasks. This tutorial focuses
# on the sequence to sequence learning: it’s a typical case to illustrate how it works.
#
# As for the dataset,
we provide two toy
tasks: copy and sort, together
# As for the dataset,
there are two example
tasks: copy and sort, together
# with two real-world translation tasks: multi30k en-de task and wmt14
# en-de task.
#
...
...
@@ -509,18 +510,17 @@ Transformer Tutorial
# (Train/Valid/Test: 4500966/3000/3003)
#
# .. note::
# We are working on training with wmt14, which requires
# Multi-GPU support(this would be fixed soon).
# Training with wmt14 requires multi-GPU support and is not available. Contributions are welcome!
#
# Graph
B
uilding
# Graph
b
uilding
# ~~~~~~~~~~~~~~
#
# **Batching**
Just like how we
handle Tree-LSTM.
We b
uild a graph pool in
# **Batching**
This is similar to the way you
handle Tree-LSTM.
B
uild a graph pool in
# advance, including all possible combination of input lengths and output
# lengths. Then for each sample in a batch,
we
call ``dgl.batch`` to batch
# lengths. Then for each sample in a batch, call ``dgl.batch`` to batch
# graphs of their sizes together in to a single large graph.
#
#
We have
wrap
ped
the process of creating graph pool and building
#
You can
wrap the process of creating graph pool and building
# BatchedGraph in ``dataset.GraphPool`` and
# ``dataset.TranslationDataset``.
#
...
...
@@ -572,13 +572,12 @@ Transformer Tutorial
# Put it all together
# -------------------
#
#
We t
rain a one-head transformer with one layer, 128 dimension on copy
# task.
O
ther parameters
are set to
default.
#
T
rain a one-head transformer with one layer, 128 dimension on copy
# task.
Set o
ther parameters
to the
default.
#
# Note that we do not involve inference module in this tutorial (which
# requires beam search), please refer to the `Github
# Repo <https://github.com/dmlc/dgl/tree/master/examples/pytorch/transformer>`__
# for full implementation.
# Inference module is not included in this tutorial. It
# requires beam search. For a full implementation, see the `GitHub
# repo <https://github.com/dmlc/dgl/tree/master/examples/pytorch/transformer>`__.
#
# .. code:: python
#
...
...
@@ -638,8 +637,8 @@ Transformer Tutorial
# Visualization
# -------------
#
# After training,
we
can visualize the attention
our
Transformer generates
# on copy task
:
# After training,
you
can visualize the attention
that the
Transformer generates
# on copy task
.
#
# .. code:: python
#
...
...
@@ -648,11 +647,11 @@ Transformer Tutorial
# # visualize head 0 of encoder-decoder attention
# att_animation(att_maps, 'e2d', src_seq, tgt_seq, 0)
#
# |image5| from the figure
we
see the decoder nodes gradually learns to
# |image5| from the figure
you
see the decoder nodes gradually learns to
# attend to corresponding nodes in input sequence, which is the expected
# behavior.
#
# Multi-
H
ead
A
ttention
# Multi-
h
ead
a
ttention
# ~~~~~~~~~~~~~~~~~~~~
#
# Besides the attention of a one-head attention trained on toy task. We
...
...
@@ -660,9 +659,8 @@ Transformer Tutorial
# Decoder’s Self Attention and the Encoder-Decoder attention of an
# one-Layer Transformer network trained on multi-30k dataset.
#
# From the visualization we observe the diversity of different heads
# (which is what we expected: different heads learn different relations
# between word pairs):
# From the visualization you see the diversity of different heads, which is what you would
# expect. Different heads learn different relations between word pairs.
#
# - **Encoder Self-Attention** |image6|
#
...
...
@@ -677,8 +675,8 @@ Transformer Tutorial
# Adaptive Universal Transformer
# ------------------------------
#
# A recent paper by Google
:
`Universal
# Transformer <https://arxiv.org/pdf/1807.03819.pdf>`__ is an example to
# A recent
research
paper by Google
,
`Universal
# Transformer <https://arxiv.org/pdf/1807.03819.pdf>`__
,
is an example to
# show how ``update_graph`` adapts to more complex updating rules.
#
# The Universal Transformer was proposed to address the problem that
...
...
@@ -696,10 +694,10 @@ Transformer Tutorial
# (ACT) <https://arxiv.org/pdf/1603.08983.pdf>`__ mechanism to allow the
# model to dynamically adjust the number of times the representation of
# each position in a sequence is revised (refereed to as **step**
# here
in
after). This model is also known as the Adaptive Universal
# Transformer
(referred to as AUT hereinafter
).
# hereafter). This model is also known as the Adaptive Universal
# Transformer
(AUT
).
#
# In AUT,
we
maintain an
“
active nodes
”
list. In each step :math:`t`, we
# In AUT,
you
maintain an active nodes list. In each step :math:`t`, we
# compute a halting probability: :math:`h (0<h<1)` for all nodes in this
# list by:
#
...
...
@@ -718,7 +716,7 @@ Transformer Tutorial
#
# .. math:: s_i = \sum_{t=1}^{T} h_i^t\cdot x_i^t
#
# In DGL,
the algorithm is easy to implement, we just need to
call
# In DGL,
implement an algorithm by
call
ing
# ``update_graph`` on nodes that are still active and edges associated
# with this nodes. The following code shows the Universal Transformer
# class in DGL:
...
...
@@ -839,7 +837,7 @@ Transformer Tutorial
#
# return self.generator(g.ndata['x'][nids['dec']]), act_loss * self.act_loss_weight
#
#
Here we c
all ``filter_nodes`` and ``filter_edge`` to find nodes/edges
#
C
all ``filter_nodes`` and ``filter_edge`` to find nodes/edges
# that are still active:
#
# .. note::
...
...
@@ -851,15 +849,15 @@ Transformer Tutorial
# and an edge ID list/tensor as input, then returns a tensor of edge IDs
# that satisfy the given predicate.
#
#
f
or the full implementation,
please refer to our
`Git
h
ub
#
R
epo <https://github.com/dmlc/dgl/tree/master/examples/pytorch/transformer/modules/act.py>`__.
#
F
or the full implementation,
see the
`Git
H
ub
#
r
epo <https://github.com/dmlc/dgl/tree/master/examples/pytorch/transformer/modules/act.py>`__.
#
# The figure below shows the effect of Adaptive Computational
# Time
(d
ifferent positions of a sentence were revised different times
):
# Time
. D
ifferent positions of a sentence were revised different times
.
#
# |image9|
#
#
We
also visualize the dynamics of step distribution on nodes during the
#
You can
also visualize the dynamics of step distribution on nodes during the
# training of AUT on sort task(reach 99.7% accuracy), which demonstrates
# how AUT learns to reduce recurrence steps during training. |image10|
#
...
...
@@ -876,7 +874,7 @@ Transformer Tutorial
# .. |image10| image:: https://s1.ax1x.com/2018/12/06/F1r8Cq.gif
#
# .. note::
#
We apologize that this
notebook itself is not
runn
able due to many dependencies
,
#
please d
ownload
the
`7_transformer.py <https://s3.us-east-2.amazonaws.com/dgl.ai/tutorial/7_transformer.py>`__,
#
The
notebook itself is not
execut
able due to many dependencies
.
#
D
ownload `7_transformer.py <https://s3.us-east-2.amazonaws.com/dgl.ai/tutorial/7_transformer.py>`__,
# and copy the python script to directory ``examples/pytorch/transformer``
# then run ``python 7_transformer.py`` to see how it works.
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment