[Doc] Improve Capsule with Jinyang & Fix wrong tutorial level layout (#236)

* improve capsule tutorial with jinyang * fix wrong layout of second-level tutorial * delete transformer

[Doc] Improve Capsule with Jinyang & Fix wrong tutorial level layout (#236)
* improve capsule tutorial with jinyang * fix wrong layout of second-level tutorial * delete transformer
16cc5287 · VoVAllen · Minjie Wang · dafe4671 · 16cc5287 · 16cc5287
Commit 16cc5287 authored Dec 04, 2018 by VoVAllen Committed by Minjie Wang Dec 04, 2018
17 changed files
--- a/docs/source/conf.py
+++ b/docs/source/conf.py
@@ -195,8 +195,8 @@ intersphinx_mapping = {
 # sphinx gallery configurations
 from sphinx_gallery.sorting import FileNameSortKey
-examples_dirs = ['../../tutorials']  # path to find sources
+examples_dirs = ['../../tutorials/basics','../../tutorials/models']  # path to find sources
-gallery_dirs = ['tutorials']  # path to generate docs
+gallery_dirs = ['tutorials/basics','tutorials/models']  # path to generate docs
 reference_url = {
    'dgl' : None,
    'numpy': 'http://docs.scipy.org/doc/numpy/',

--- a/docs/source/index.rst
+++ b/docs/source/index.rst
@@ -65,7 +65,8 @@ credit, see `here <https://www.dgl.ai/ack>`_.
   :caption: Tutorials
   :glob:
-   tutorials/index
+   tutorials/basics/index
+   tutorials/models/index
 .. toctree::
   :maxdepth: 2

--- a/tutorials/1_first.py
+++ b/tutorials/1_first.py
--- a/tutorials/2_basics.py
+++ b/tutorials/2_basics.py
--- a/tutorials/3_pagerank.py
+++ b/tutorials/3_pagerank.py
@@ -156,7 +156,7 @@ def pagerank_level2(g):
 ###############################################################################
 # Besides ``update_all``, we also have ``pull``, ``push``, and ``send_and_recv``
-# in this level-2 category. Please refer to the :doc:`API reference <../api/python/graph>`
+# in this level-2 category. Please refer to the :doc:`API reference <../../api/python/graph>`
 # for more details.
@@ -200,7 +200,7 @@ def pagerank_builtin(g):
 #
 # `This section <spmv_>`_ describes why spMV can speed up the scatter-gather
 # phase in PageRank.  For more details about the builtin functions in DGL,
-# please read the :doc:`API reference <../api/python/function>`.
+# please read the :doc:`API reference <../../api/python/function>`.
 #
 # You can also download and run the codes to feel the difference.
@@ -241,5 +241,5 @@ print(g.ndata['pv'])
 ###############################################################################
 # Next steps
 # ----------
-# Check out :doc:`GCN <models/1_gcn>` and :doc:`Capsule <models/2_capsule>`
+# Check out :doc:`GCN <../models/1_gnn/1_gcn>` and :doc:`Capsule <../models/4_old_wines/2_capsule>`
 # for more model implemenetations in DGL.
--- a/tutorials/README.txt
+++ b/tutorials/README.txt
 Basic Tutorials
 ===============
-These tutorials conver the basics of DGL.
+These tutorials cover the basics of DGL.
--- a/tutorials/models/1_gcn.py
+++ b/tutorials/models/1_gcn.py
@@ -10,7 +10,7 @@ Yu Gai, Quan Gan, Zheng Zhang
 This is a gentle introduction of using DGL to implement Graph Convolutional
 Networks (Kipf & Welling et al., `Semi-Supervised Classificaton with Graph
 Convolutional Networks <https://arxiv.org/pdf/1609.02907.pdf>`_). We build upon
-the :doc:`earlier tutorial <../3_pagerank>` on DGLGraph and demonstrate
+the :doc:`earlier tutorial <../../basics/3_pagerank>` on DGLGraph and demonstrate
 how DGL combines graph with deep neural network and learn structural representations.
 """
@@ -160,4 +160,4 @@ for epoch in range(30):
 # multiplication kernels (such as Kipf's
 # `pygcn <https://github.com/tkipf/pygcn>`_ code). The above DGL implementation
 # in fact has already used this trick due to the use of builtin functions. To
-# understand what is under the hood, please read our tutorial on :doc:`PageRank <../3_pagerank>`.
+# understand what is under the hood, please read our tutorial on :doc:`PageRank <../../basics/3_pagerank>`.
--- a/tutorials/models/4_rgcn.py
+++ b/tutorials/models/4_rgcn.py
--- a/tutorials/models/6_line_graph.py
+++ b/tutorials/models/6_line_graph.py
--- a/tutorials/models/1_gnn/README.txt
+++ b/tutorials/models/1_gnn/README.txt
+.. _tutorials1-index:
+Graph Neural Network and its variant
+------------------------------------
+* **GCN** `[paper] <https://arxiv.org/abs/1609.02907>`__ `[tutorial] <models/1_gcn.html>`__
+  `[code] <https://github.com/jermainewang/dgl/blob/master/examples/pytorch/gcn/gcn.py>`__:
+  this is the vanilla GCN. The tutorial covers the basic uses of DGL APIs.
+* **GAT** `[paper] <https://arxiv.org/abs/1710.10903>`__
+  `[code] <https://github.com/jermainewang/dgl/blob/master/examples/pytorch/gat/gat.py>`__:
+  the key extension of GAT w.r.t vanilla GCN is deploying multi-head attention
+  among neighborhood of a node, thus greatly enhances the capacity and
+  expressiveness of the model.
+* **R-GCN** `[paper] <https://arxiv.org/abs/1703.06103>`__ `[tutorial] <models/4_rgcn.html>`__
+  [code (wip)]: the key
+  difference of RGNN is to allow multi-edges among two entities of a graph, and
+  edges with distinct relationships are encoded differently. This is an
+  interesting extension of GCN that can have a lot of applications of its own.
+* **LGNN** `[paper] <https://arxiv.org/abs/1705.08415>`__ `[tutorial (wip)]` `[code (wip)]`:
+  this model focuses on community detection by inspecting graph structures. It
+  uses representations of both the orignal graph and its line-graph companion. In
+  addition to demonstrate how an algorithm can harness multiple graphs, our
+  implementation shows how one can judiciously mix vanilla tensor operation,
+  sparse-matrix tensor operations, along with message-passing with DGL.
+* **SSE** `[paper] <http://proceedings.mlr.press/v80/dai18a/dai18a.pdf>`__ `[tutorial (wip)]`
+  `[code] <https://github.com/jermainewang/dgl/blob/master/examples/mxnet/sse/sse_batch.py>`__:
+  the emphasize here is *giant* graph that cannot fit comfortably on one GPU
+  card. SSE is an example to illustrate the co-design of both algrithm and
+  system: sampling to guarantee asymptotic covergence while lowering the
+  complexity, and batching across samples for maximum parallelism.
\ No newline at end of file
--- a/tutorials/models/3_tree-lstm.py
+++ b/tutorials/models/3_tree-lstm.py
--- a/tutorials/models/2_small_graph/README.txt
+++ b/tutorials/models/2_small_graph/README.txt
+.. _tutorials2-index:
+Dealing with many small graphs
+------------------------------
+* **Tree-LSTM** `[paper] <https://arxiv.org/abs/1503.00075>`__ `[tutorial] <models/3_tree-lstm.html>`__
+  `[code] <https://github.com/jermainewang/dgl/blob/master/examples/pytorch/tree_lstm/tree_lstm.py>`__:
+  sentences of natural languages have inherent structures, which are thrown away
+  by treating them simply as sequences. Tree-LSTM is a powerful model that learns
+  the representation by leveraging prior syntactic structures (e.g. parse-tree).
+  The challenge to train it well is that simply by padding a sentence to the
+  maximum length no longer works, since trees of different sentences have
+  different sizes and topologies. DGL solves this problem by throwing the trees
+  into a bigger "container" graph, and use message-passing to explore maximum
+  parallelism. The key API we use is batching.
--- a/tutorials/models/5_dgmg.py
+++ b/tutorials/models/5_dgmg.py
--- a/tutorials/models/3_generative_model/README.txt
+++ b/tutorials/models/3_generative_model/README.txt
+.. _tutorials3-index:
+Generative models
+------------------------------
+* **DGMG** `[paper] <https://arxiv.org/abs/1803.03324>`__ `[tutorial] <models/5_dgmg.html>`__
+  `[code] <https://github.com/jermainewang/dgl/tree/master/examples/pytorch/dgmg>`__:
+  this model belongs to the important family that deals with structural
+  generation. DGMG is interesting because its state-machine approach is the most
+  general. It is also very challenging because, unlike Tree-LSTM, every sample
+  has a dynamic, probability-driven structure that is not available before
+  training. We are able to progressively leverage intra- and inter-graph
+  parallelism to steadily improve the performance.
+* **JTNN** `[paper] <https://arxiv.org/abs/1802.04364>`__ `[code (wip)]`: unlike DGMG, this
+  paper generates molecular graphs using the framework of variational
+  auto-encoder. Perhaps more interesting is its approach to build structure
+  hierarchically, in the case of molecular, with junction tree as the middle
+  scaffolding.
--- a/tutorials/models/2_capsule.py
+++ b/tutorials/models/2_capsule.py
@@ -4,62 +4,62 @@
 Capsule Network Tutorial
 ===========================
-**Author**: Jinjing Zhou, `Jake
+**Author**: Jinjing Zhou, `Jake Zhao <https://cs.nyu.edu/~jakezhao/>`_, Zheng Zhang, Jinyang Li
-Zhao <https://cs.nyu.edu/~jakezhao/>`_, Zheng Zhang
-It is perhaps a little surprising that some of the more classical models can
+It is perhaps a little surprising that some of the more classical models
-also be described in terms of graphs, offering a different perspective.
+can also be described in terms of graphs, offering a different
-This tutorial describes how this is done for the `capsule network <http://arxiv.org/abs/1710.09829>`__.
+perspective. This tutorial describes how this can be done for the
+`capsule network <http://arxiv.org/abs/1710.09829>`__.
 """
 #######################################################################################
 # Key ideas of Capsule
 # --------------------
 #
-# There are two key ideas that the Capsule model offers.
+# The Capsule model offers two key ideas.
 #
-# **Richer representations** In classic convolutional network, a scalar
+# **Richer representation** In classic convolutional networks, a scalar
-# value represents the activation of a given feature. Instead, a capsule
+# value represents the activation of a given feature. By contrast, a
-# outputs a vector, whose norm represents the probability of a feature,
+# capsule outputs a vector. The vector's length represents the probability
-# and the orientation its properties.
+# of a feature being present. The vector's orientation represents the
+# various properties of the feature (such as pose, deformation, texture
+# etc.).
 #
-# .. figure:: https://i.imgur.com/55Ovkdh.png
+# |image0|
-#    :alt:
 #
-# **Dynamic routing** To generalize max-pooling, there is another
+# **Dynamic routing** The output of a capsule is preferentially sent to
-# interesting proposed by the authors, as a representational more powerful
+# certain parents in the layer above based on how well the capsule's
-# way to construct higher level feature from its low levels. Consider a
+# prediction agrees with that of a parent. Such dynamic
-# capsule :math:`u_i`. The way :math:`u_i` is integrated to the next level
+# "routing-by-agreement" generalizes the static routing of max-pooling.
-# capsules take two steps:
 #
-# 1. :math:`u_i` projects differently to different higher level capsules
+# During training, routing is done iteratively; each iteration adjusts
-#    via a linear transformation: :math:`\hat{u}_{j|i} = W_{ij}u_i`.
+# "routing weights" between capsules based on their observed agreements,
-# 2. :math:`\hat{u}_{j|i}` routes to the higher level capsules by
+# in a manner similar to a k-means algorithm or `competitive
-#    spreading itself with a weighted sum, and the weight is dynamically
+# learning <https://en.wikipedia.org/wiki/Competitive_learning>`__.
-#    determined by iteratively modify the and checking against the
-#    "consistency" between :math:`\hat{u}_{j|i}` and :math:`v_j`, for any
-#    :math:`v_j`. Note that this is similar to a k-means algorithm or
-#    `competive
-#    learning <https://en.wikipedia.org/wiki/Competitive_learning>`__ in
-#    spirit. At the end of iterations, :math:`v_j` now integrates the
-#    lower level capsules.
 #
-# The full algorithm is the following: |image0|
+# In this tutorial, we show how capsule's dynamic routing algorithm can be
-#
+# naturally expressed as a graph algorithm. Our implementation is adapted
-# The dynamic routing step can be naturally expressed as a graph
+# from `Cedric
-# algorithm. This is the focus of this tutorial. Our implementation is
-# adapted from `Cedric
 # Chee <https://github.com/cedrickchee/capsule-net-pytorch>`__, replacing
-# only the routing layer, and achieving similar speed and accuracy.
+# only the routing layer. Our version achieves similar speed and accuracy.
 #
 # Model Implementation
-# -----------------------------------
+# ----------------------
-# Step 1: Setup and Graph Initialiation
+# Step 1: Setup and Graph Initialization
-# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+#
+# The connectivity between two layers of capsules form a directed,
+# bipartite graph, as shown in the Figure below.
+#
+# |image1|
 #
-# The below figure shows the directed bipartitie graph built for capsules
+# Each node :math:`j` is associated with feature :math:`v_j`,
-# network. We denote :math:`b_{ij}`, :math:`\hat{u}_{j|i}` as edge
+# representing its capsule’s output. Each edge is associated with
-# features and :math:`v_j` as node features. |image1|
+# features :math:`b_{ij}` and :math:`\hat{u}_{j|i}`. :math:`b_{ij}`
+# determines routing weights, and :math:`\hat{u}_{j|i}` represents the
+# prediction of capsule :math:`i` for :math:`j`.
 #
+# Here's how we set up the graph and initialize node and edge features.
 import torch.nn as nn
 import torch as th
 import torch.nn.functional as F
@@ -88,32 +88,33 @@ def init_graph(in_nodes, out_nodes, f_size):
 #########################################################################################
 # Step 2: Define message passing functions
 # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-# Recall the following steps, and they are implemented in the class
-# ``DGLRoutingLayer`` as the followings:
 #
-# 1. Normalize over out edges
+# This is the pseudo code for Capsule's routing algorithm as given in the
+# paper:
+#
+# |image2|
+# We implement pseudo code lines 4-7 in the class `DGLRoutingLayer` as the following steps:
+#
+# 1. Calculate coupling coefficients:
 #
-#    -  Softmax over all out-edge of in-capsules
+#    -  Coefficients are the softmax over all out-edge of in-capsules:
-#       :math:`\textbf{c}_i = \text{softmax}(\textbf{b}_i)`.
+#       :math:`\textbf{c}_{i,j} = \text{softmax}(\textbf{b}_{i,j})`.
 #
-# 2. Weighted sum over all in-capsules
+# 2. Calculate weighted sum over all in-capsules:
 #
-#    -  Out-capsules equals weighted sum of in-capsules
+#    -  Output of a capsule is equal to the weighted sum of its in-capsules
 #       :math:`s_j=\sum_i c_{ij}\hat{u}_{j|i}`
 #
-# 3. Squash Operation
+# 3. Squash outputs:
 #
-#    -  Squashing function is to ensure that short capsule vectors get
+#    -  Squash the length of a capsule's output vector to range (0,1), so it can represent the probability (of some feature being present).
-#       shrunk to almost zero length while the long capsule vectors get
-#       shrunk to a length slightly below 1. Its norm is expected to
-#       represents probabilities at some levels.
 #    -  :math:`v_j=\text{squash}(s_j)=\frac{||s_j||^2}{1+||s_j||^2}\frac{s_j}{||s_j||}`
 #
-# 4. Update weights by agreement
+# 4. Update weights by the amount of agreement:
 #
-#    -  :math:`\hat{u}_{j|i}\cdot v_j` can be considered as agreement
+#    -  The scalar product :math:`\hat{u}_{j|i}\cdot v_j` can be considered as how well capsule :math:`i` agrees with :math:`j`. It is used to update
-#       between current capsule and updated capsule,
 #       :math:`b_{ij}=b_{ij}+\hat{u}_{j|i}\cdot v_j`
 class DGLRoutingLayer(nn.Module):
    def __init__(self, in_nodes, out_nodes, f_size):
        super(DGLRoutingLayer, self).__init__()
@@ -172,9 +173,9 @@ u_hat = th.randn(in_nodes * out_nodes, f_size)
 routing = DGLRoutingLayer(in_nodes, out_nodes, f_size)
 ############################################################################################################
-# We can visualize the behavior by monitoring the entropy of outgoing
+# We can visualize a capsule network's behavior by monitoring the entropy
-# weights, they should start high and then drop, as the assignment
+# of coupling coefficients. They should start high and then drop, as the
-# gradually concentrate:
+# weights gradually concentrate on fewer edges:
 entropy_list = []
 dist_list = []
@@ -193,9 +194,9 @@ plt.xlabel("Number of Routing")
 plt.xticks(np.arange(len(entropy_list)))
 plt.close()
 ############################################################################################################
+# |image3|
 #
-# .. figure:: https://i.imgur.com/dMvu7p3.png
+# Alternatively, we can also watch the evolution of histograms:
-#    :alt:
 import seaborn as sns
 import matplotlib.animation as animation
@@ -216,8 +217,10 @@ ani = animation.FuncAnimation(fig, dist_animate, frames=len(entropy_list), inter
 plt.close()
 ############################################################################################################
-# Alternatively, we can also watch the evolution of histograms: |image2|
+# |image4|
-# Or monitor the how lower level capcules gradually attach to one of the higher level ones:
+#
+# Or monitor the how lower level capsules gradually attach to one of the
+# higher level ones:
 import networkx as nx
 from networkx.algorithms import bipartite
@@ -251,14 +254,16 @@ ani2 = animation.FuncAnimation(fig2, weight_animate, frames=len(dist_list), inte
 plt.close()
 ############################################################################################################
-# |image3|
+# |image5|
 #
-# The full code of this visulization is provided at
+# The full code of this visualization is provided at
 # `link <https://github.com/jermainewang/dgl/blob/master/examples/pytorch/capsule/simple_routing.py>`__; the complete
 # code that trains on MNIST is at `link <https://github.com/jermainewang/dgl/tree/tutorial/examples/pytorch/capsule>`__.
 #
-# .. |image0| image:: https://i.imgur.com/mv1W9Rv.png
+# .. |image0| image:: https://i.imgur.com/55Ovkdh.png
 # .. |image1| image:: https://i.imgur.com/9tc6GLl.png
-# .. |image2| image:: https://github.com/VoVAllen/DGL_Capsule/raw/master/routing_dist.gif
+# .. |image2| image:: https://i.imgur.com/mv1W9Rv.png
-# .. |image3| image:: https://github.com/VoVAllen/DGL_Capsule/raw/master/routing_vis.gif
+# .. |image3| image:: https://i.imgur.com/dMvu7p3.png
-#
+# .. |image4| image:: https://github.com/VoVAllen/DGL_Capsule/raw/master/routing_dist.gif
+# .. |image5| image:: https://github.com/VoVAllen/DGL_Capsule/raw/master/routing_vis.gif
--- a/tutorials/models/4_old_wines/README.txt
+++ b/tutorials/models/4_old_wines/README.txt
+.. _tutorials4-index:
+Old (new) wines in new bottle
+-----------------------------
+* **Capsule** `[paper] <https://arxiv.org/abs/1710.09829>`__ `[tutorial] <models/2_capsule.html>`__
+  `[code] <https://github.com/jermainewang/dgl/tree/master/examples/pytorch/capsule>`__: this new
+  computer vision model has two key ideas -- enhancing the feature representation
+  in a vector form (instead of a scalar) called *capsule*, and replacing
+  maxpooling with dynamic routing. The idea of dynamic routing is to integrate a
+  lower level capsule to one (or several) of a higher level one with
+  non-parametric message-passing. We show how the later can be nicely implemented
+  with DGL APIs.
+* **Transformer** `[paper] <https://arxiv.org/abs/1706.03762>`__ `[tutorial (wip)]` `[code (wip)]` and
+  **Universal Transformer** `[paper] <https://arxiv.org/abs/1807.03819>`__ `[tutorial (wip)]`
+  `[code (wip)]`: these
+  two models replace RNN with several layers of multi-head attention to encode
+  and discover structures among tokens of a sentence. These attention mechanisms
+  can similarly formulated as graph operations with message-passing.
--- a/tutorials/models/README.txt
+++ b/tutorials/models/README.txt
@@ -15,85 +15,4 @@ We categorize the models below, providing links to the original code and
 tutorial when appropriate. As will become apparent, these models stress the use
 of different DGL APIs.
-Graph Neural Network and its variant
------------------------------------
-* **GCN** `[paper] <https://arxiv.org/abs/1609.02907>`__ `[tutorial] <models/1_gcn.html>`__
-  `[code] <https://github.com/jermainewang/dgl/blob/master/examples/pytorch/gcn/gcn.py>`__:
-  this is the vanilla GCN. The tutorial covers the basic uses of DGL APIs.
-* **GAT** `[paper] <https://arxiv.org/abs/1710.10903>`__
-  `[code] <https://github.com/jermainewang/dgl/blob/master/examples/pytorch/gat/gat.py>`__:
-  the key extension of GAT w.r.t vanilla GCN is deploying multi-head attention
-  among neighborhood of a node, thus greatly enhances the capacity and
-  expressiveness of the model.
-* **R-GCN** `[paper] <https://arxiv.org/abs/1703.06103>`__ `[tutorial] <models/4_rgcn.html>`__
-  [code (wip)]: the key
-  difference of RGNN is to allow multi-edges among two entities of a graph, and
-  edges with distinct relationships are encoded differently. This is an
-  interesting extension of GCN that can have a lot of applications of its own.
-* **LGNN** `[paper] <https://arxiv.org/abs/1705.08415>`__ `[tutorial (wip)]` `[code (wip)]`:
-  this model focuses on community detection by inspecting graph structures. It
-  uses representations of both the orignal graph and its line-graph companion. In
-  addition to demonstrate how an algorithm can harness multiple graphs, our
-  implementation shows how one can judiciously mix vanilla tensor operation,
-  sparse-matrix tensor operations, along with message-passing with DGL. 
-* **SSE** `[paper] <http://proceedings.mlr.press/v80/dai18a/dai18a.pdf>`__ `[tutorial (wip)]`
-  `[code] <https://github.com/jermainewang/dgl/blob/master/examples/mxnet/sse/sse_batch.py>`__:
-  the emphasize here is *giant* graph that cannot fit comfortably on one GPU
-  card. SSE is an example to illustrate the co-design of both algrithm and
-  system: sampling to guarantee asymptotic covergence while lowering the
-  complexity, and batching across samples for maximum parallelism.
-Dealing with many small graphs
------------------------------
-* **Tree-LSTM** `[paper] <https://arxiv.org/abs/1503.00075>`__ `[tutorial] <models/3_tree-lstm.html>`__
-  `[code] <https://github.com/jermainewang/dgl/blob/master/examples/pytorch/tree_lstm/tree_lstm.py>`__:
-  sentences of natural languages have inherent structures, which are thrown away
-  by treating them simply as sequences. Tree-LSTM is a powerful model that learns
-  the representation by leveraging prior syntactic structures (e.g. parse-tree).
-  The challenge to train it well is that simply by padding a sentence to the
-  maximum length no longer works, since trees of different sentences have
-  different sizes and topologies. DGL solves this problem by throwing the trees
-  into a bigger "container" graph, and use message-passing to explore maximum
-  parallelism. The key API we use is batching.
-Generative models
------------------------------
-* **DGMG** `[paper] <https://arxiv.org/abs/1803.03324>`__ `[tutorial] <models/5_dgmg.html>`__
-  `[code] <https://github.com/jermainewang/dgl/tree/master/examples/pytorch/dgmg>`__:
-  this model belongs to the important family that deals with structural
-  generation. DGMG is interesting because its state-machine approach is the most
-  general. It is also very challenging because, unlike Tree-LSTM, every sample
-  has a dynamic, probability-driven structure that is not available before
-  training. We are able to progressively leverage intra- and inter-graph
-  parallelism to steadily improve the performance.
-* **JTNN** `[paper] <https://arxiv.org/abs/1802.04364>`__ `[code (wip)]`: unlike DGMG, this
-  paper generates molecular graphs using the framework of variational
-  auto-encoder. Perhaps more interesting is its approach to build structure
-  hierarchically, in the case of molecular, with junction tree as the middle
-  scaffolding.
-Old (new) wines in new bottle
-----------------------------
-* **Capsule** `[paper] <https://arxiv.org/abs/1710.09829>`__ `[tutorial] <models/2_capsule.html>`__
-  `[code] <https://github.com/jermainewang/dgl/tree/master/examples/pytorch/capsule>`__: this new
-  computer vision model has two key ideas -- enhancing the feature representation
-  in a vector form (instead of a scalar) called *capsule*, and replacing
-  maxpooling with dynamic routing. The idea of dynamic routing is to integrate a
-  lower level capsule to one (or several) of a higher level one with
-  non-parametric message-passing. We show how the later can be nicely implemented
-  with DGL APIs.
-* **Transformer** `[paper] <https://arxiv.org/abs/1706.03762>`__ `[tutorial (wip)]` `[code (wip)]` and
-  **Universal Transformer** `[paper] <https://arxiv.org/abs/1807.03819>`__ `[tutorial (wip)]`
-  `[code (wip)]`: these
-  two models replace RNN with several layers of multi-head attention to encode
-  and discover structures among tokens of a sentence. These attention mechanisms
-  can similarly formulated as graph operations with message-passing.