Commit 16cc5287 authored by VoVAllen's avatar VoVAllen Committed by Minjie Wang
Browse files

[Doc] Improve Capsule with Jinyang & Fix wrong tutorial level layout (#236)

* improve capsule tutorial with jinyang

* fix wrong layout of second-level tutorial

* delete transformer
parent dafe4671
......@@ -195,8 +195,8 @@ intersphinx_mapping = {
# sphinx gallery configurations
from sphinx_gallery.sorting import FileNameSortKey
examples_dirs = ['../../tutorials'] # path to find sources
gallery_dirs = ['tutorials'] # path to generate docs
examples_dirs = ['../../tutorials/basics','../../tutorials/models'] # path to find sources
gallery_dirs = ['tutorials/basics','tutorials/models'] # path to generate docs
reference_url = {
'dgl' : None,
'numpy': 'http://docs.scipy.org/doc/numpy/',
......
......@@ -65,7 +65,8 @@ credit, see `here <https://www.dgl.ai/ack>`_.
:caption: Tutorials
:glob:
tutorials/index
tutorials/basics/index
tutorials/models/index
.. toctree::
:maxdepth: 2
......
......@@ -156,7 +156,7 @@ def pagerank_level2(g):
###############################################################################
# Besides ``update_all``, we also have ``pull``, ``push``, and ``send_and_recv``
# in this level-2 category. Please refer to the :doc:`API reference <../api/python/graph>`
# in this level-2 category. Please refer to the :doc:`API reference <../../api/python/graph>`
# for more details.
......@@ -200,7 +200,7 @@ def pagerank_builtin(g):
#
# `This section <spmv_>`_ describes why spMV can speed up the scatter-gather
# phase in PageRank. For more details about the builtin functions in DGL,
# please read the :doc:`API reference <../api/python/function>`.
# please read the :doc:`API reference <../../api/python/function>`.
#
# You can also download and run the codes to feel the difference.
......@@ -241,5 +241,5 @@ print(g.ndata['pv'])
###############################################################################
# Next steps
# ----------
# Check out :doc:`GCN <models/1_gcn>` and :doc:`Capsule <models/2_capsule>`
# Check out :doc:`GCN <../models/1_gnn/1_gcn>` and :doc:`Capsule <../models/4_old_wines/2_capsule>`
# for more model implemenetations in DGL.
Basic Tutorials
===============
These tutorials conver the basics of DGL.
These tutorials cover the basics of DGL.
......@@ -10,7 +10,7 @@ Yu Gai, Quan Gan, Zheng Zhang
This is a gentle introduction of using DGL to implement Graph Convolutional
Networks (Kipf & Welling et al., `Semi-Supervised Classificaton with Graph
Convolutional Networks <https://arxiv.org/pdf/1609.02907.pdf>`_). We build upon
the :doc:`earlier tutorial <../3_pagerank>` on DGLGraph and demonstrate
the :doc:`earlier tutorial <../../basics/3_pagerank>` on DGLGraph and demonstrate
how DGL combines graph with deep neural network and learn structural representations.
"""
......@@ -160,4 +160,4 @@ for epoch in range(30):
# multiplication kernels (such as Kipf's
# `pygcn <https://github.com/tkipf/pygcn>`_ code). The above DGL implementation
# in fact has already used this trick due to the use of builtin functions. To
# understand what is under the hood, please read our tutorial on :doc:`PageRank <../3_pagerank>`.
# understand what is under the hood, please read our tutorial on :doc:`PageRank <../../basics/3_pagerank>`.
.. _tutorials1-index:
Graph Neural Network and its variant
------------------------------------
* **GCN** `[paper] <https://arxiv.org/abs/1609.02907>`__ `[tutorial] <models/1_gcn.html>`__
`[code] <https://github.com/jermainewang/dgl/blob/master/examples/pytorch/gcn/gcn.py>`__:
this is the vanilla GCN. The tutorial covers the basic uses of DGL APIs.
* **GAT** `[paper] <https://arxiv.org/abs/1710.10903>`__
`[code] <https://github.com/jermainewang/dgl/blob/master/examples/pytorch/gat/gat.py>`__:
the key extension of GAT w.r.t vanilla GCN is deploying multi-head attention
among neighborhood of a node, thus greatly enhances the capacity and
expressiveness of the model.
* **R-GCN** `[paper] <https://arxiv.org/abs/1703.06103>`__ `[tutorial] <models/4_rgcn.html>`__
[code (wip)]: the key
difference of RGNN is to allow multi-edges among two entities of a graph, and
edges with distinct relationships are encoded differently. This is an
interesting extension of GCN that can have a lot of applications of its own.
* **LGNN** `[paper] <https://arxiv.org/abs/1705.08415>`__ `[tutorial (wip)]` `[code (wip)]`:
this model focuses on community detection by inspecting graph structures. It
uses representations of both the orignal graph and its line-graph companion. In
addition to demonstrate how an algorithm can harness multiple graphs, our
implementation shows how one can judiciously mix vanilla tensor operation,
sparse-matrix tensor operations, along with message-passing with DGL.
* **SSE** `[paper] <http://proceedings.mlr.press/v80/dai18a/dai18a.pdf>`__ `[tutorial (wip)]`
`[code] <https://github.com/jermainewang/dgl/blob/master/examples/mxnet/sse/sse_batch.py>`__:
the emphasize here is *giant* graph that cannot fit comfortably on one GPU
card. SSE is an example to illustrate the co-design of both algrithm and
system: sampling to guarantee asymptotic covergence while lowering the
complexity, and batching across samples for maximum parallelism.
\ No newline at end of file
.. _tutorials2-index:
Dealing with many small graphs
------------------------------
* **Tree-LSTM** `[paper] <https://arxiv.org/abs/1503.00075>`__ `[tutorial] <models/3_tree-lstm.html>`__
`[code] <https://github.com/jermainewang/dgl/blob/master/examples/pytorch/tree_lstm/tree_lstm.py>`__:
sentences of natural languages have inherent structures, which are thrown away
by treating them simply as sequences. Tree-LSTM is a powerful model that learns
the representation by leveraging prior syntactic structures (e.g. parse-tree).
The challenge to train it well is that simply by padding a sentence to the
maximum length no longer works, since trees of different sentences have
different sizes and topologies. DGL solves this problem by throwing the trees
into a bigger "container" graph, and use message-passing to explore maximum
parallelism. The key API we use is batching.
.. _tutorials3-index:
Generative models
------------------------------
* **DGMG** `[paper] <https://arxiv.org/abs/1803.03324>`__ `[tutorial] <models/5_dgmg.html>`__
`[code] <https://github.com/jermainewang/dgl/tree/master/examples/pytorch/dgmg>`__:
this model belongs to the important family that deals with structural
generation. DGMG is interesting because its state-machine approach is the most
general. It is also very challenging because, unlike Tree-LSTM, every sample
has a dynamic, probability-driven structure that is not available before
training. We are able to progressively leverage intra- and inter-graph
parallelism to steadily improve the performance.
* **JTNN** `[paper] <https://arxiv.org/abs/1802.04364>`__ `[code (wip)]`: unlike DGMG, this
paper generates molecular graphs using the framework of variational
auto-encoder. Perhaps more interesting is its approach to build structure
hierarchically, in the case of molecular, with junction tree as the middle
scaffolding.
......@@ -4,62 +4,62 @@
Capsule Network Tutorial
===========================
**Author**: Jinjing Zhou, `Jake
Zhao <https://cs.nyu.edu/~jakezhao/>`_, Zheng Zhang
**Author**: Jinjing Zhou, `Jake Zhao <https://cs.nyu.edu/~jakezhao/>`_, Zheng Zhang, Jinyang Li
It is perhaps a little surprising that some of the more classical models can
also be described in terms of graphs, offering a different perspective.
This tutorial describes how this is done for the `capsule network <http://arxiv.org/abs/1710.09829>`__.
It is perhaps a little surprising that some of the more classical models
can also be described in terms of graphs, offering a different
perspective. This tutorial describes how this can be done for the
`capsule network <http://arxiv.org/abs/1710.09829>`__.
"""
#######################################################################################
# Key ideas of Capsule
# --------------------
#
# There are two key ideas that the Capsule model offers.
# The Capsule model offers two key ideas.
#
# **Richer representations** In classic convolutional network, a scalar
# value represents the activation of a given feature. Instead, a capsule
# outputs a vector, whose norm represents the probability of a feature,
# and the orientation its properties.
# **Richer representation** In classic convolutional networks, a scalar
# value represents the activation of a given feature. By contrast, a
# capsule outputs a vector. The vector's length represents the probability
# of a feature being present. The vector's orientation represents the
# various properties of the feature (such as pose, deformation, texture
# etc.).
#
# .. figure:: https://i.imgur.com/55Ovkdh.png
# :alt:
# |image0|
#
# **Dynamic routing** To generalize max-pooling, there is another
# interesting proposed by the authors, as a representational more powerful
# way to construct higher level feature from its low levels. Consider a
# capsule :math:`u_i`. The way :math:`u_i` is integrated to the next level
# capsules take two steps:
# **Dynamic routing** The output of a capsule is preferentially sent to
# certain parents in the layer above based on how well the capsule's
# prediction agrees with that of a parent. Such dynamic
# "routing-by-agreement" generalizes the static routing of max-pooling.
#
# 1. :math:`u_i` projects differently to different higher level capsules
# via a linear transformation: :math:`\hat{u}_{j|i} = W_{ij}u_i`.
# 2. :math:`\hat{u}_{j|i}` routes to the higher level capsules by
# spreading itself with a weighted sum, and the weight is dynamically
# determined by iteratively modify the and checking against the
# "consistency" between :math:`\hat{u}_{j|i}` and :math:`v_j`, for any
# :math:`v_j`. Note that this is similar to a k-means algorithm or
# `competive
# learning <https://en.wikipedia.org/wiki/Competitive_learning>`__ in
# spirit. At the end of iterations, :math:`v_j` now integrates the
# lower level capsules.
# During training, routing is done iteratively; each iteration adjusts
# "routing weights" between capsules based on their observed agreements,
# in a manner similar to a k-means algorithm or `competitive
# learning <https://en.wikipedia.org/wiki/Competitive_learning>`__.
#
# The full algorithm is the following: |image0|
#
# The dynamic routing step can be naturally expressed as a graph
# algorithm. This is the focus of this tutorial. Our implementation is
# adapted from `Cedric
# In this tutorial, we show how capsule's dynamic routing algorithm can be
# naturally expressed as a graph algorithm. Our implementation is adapted
# from `Cedric
# Chee <https://github.com/cedrickchee/capsule-net-pytorch>`__, replacing
# only the routing layer, and achieving similar speed and accuracy.
# only the routing layer. Our version achieves similar speed and accuracy.
#
# Model Implementation
# -----------------------------------
# Step 1: Setup and Graph Initialiation
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# ----------------------
# Step 1: Setup and Graph Initialization
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#
# The connectivity between two layers of capsules form a directed,
# bipartite graph, as shown in the Figure below.
#
# |image1|
#
# The below figure shows the directed bipartitie graph built for capsules
# network. We denote :math:`b_{ij}`, :math:`\hat{u}_{j|i}` as edge
# features and :math:`v_j` as node features. |image1|
# Each node :math:`j` is associated with feature :math:`v_j`,
# representing its capsule’s output. Each edge is associated with
# features :math:`b_{ij}` and :math:`\hat{u}_{j|i}`. :math:`b_{ij}`
# determines routing weights, and :math:`\hat{u}_{j|i}` represents the
# prediction of capsule :math:`i` for :math:`j`.
#
# Here's how we set up the graph and initialize node and edge features.
import torch.nn as nn
import torch as th
import torch.nn.functional as F
......@@ -88,32 +88,33 @@ def init_graph(in_nodes, out_nodes, f_size):
#########################################################################################
# Step 2: Define message passing functions
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Recall the following steps, and they are implemented in the class
# ``DGLRoutingLayer`` as the followings:
#
# 1. Normalize over out edges
# This is the pseudo code for Capsule's routing algorithm as given in the
# paper:
#
# |image2|
# We implement pseudo code lines 4-7 in the class `DGLRoutingLayer` as the following steps:
#
# 1. Calculate coupling coefficients:
#
# - Softmax over all out-edge of in-capsules
# :math:`\textbf{c}_i = \text{softmax}(\textbf{b}_i)`.
# - Coefficients are the softmax over all out-edge of in-capsules:
# :math:`\textbf{c}_{i,j} = \text{softmax}(\textbf{b}_{i,j})`.
#
# 2. Weighted sum over all in-capsules
# 2. Calculate weighted sum over all in-capsules:
#
# - Out-capsules equals weighted sum of in-capsules
# - Output of a capsule is equal to the weighted sum of its in-capsules
# :math:`s_j=\sum_i c_{ij}\hat{u}_{j|i}`
#
# 3. Squash Operation
# 3. Squash outputs:
#
# - Squashing function is to ensure that short capsule vectors get
# shrunk to almost zero length while the long capsule vectors get
# shrunk to a length slightly below 1. Its norm is expected to
# represents probabilities at some levels.
# - Squash the length of a capsule's output vector to range (0,1), so it can represent the probability (of some feature being present).
# - :math:`v_j=\text{squash}(s_j)=\frac{||s_j||^2}{1+||s_j||^2}\frac{s_j}{||s_j||}`
#
# 4. Update weights by agreement
# 4. Update weights by the amount of agreement:
#
# - :math:`\hat{u}_{j|i}\cdot v_j` can be considered as agreement
# between current capsule and updated capsule,
# - The scalar product :math:`\hat{u}_{j|i}\cdot v_j` can be considered as how well capsule :math:`i` agrees with :math:`j`. It is used to update
# :math:`b_{ij}=b_{ij}+\hat{u}_{j|i}\cdot v_j`
class DGLRoutingLayer(nn.Module):
def __init__(self, in_nodes, out_nodes, f_size):
super(DGLRoutingLayer, self).__init__()
......@@ -172,9 +173,9 @@ u_hat = th.randn(in_nodes * out_nodes, f_size)
routing = DGLRoutingLayer(in_nodes, out_nodes, f_size)
############################################################################################################
# We can visualize the behavior by monitoring the entropy of outgoing
# weights, they should start high and then drop, as the assignment
# gradually concentrate:
# We can visualize a capsule network's behavior by monitoring the entropy
# of coupling coefficients. They should start high and then drop, as the
# weights gradually concentrate on fewer edges:
entropy_list = []
dist_list = []
......@@ -193,9 +194,9 @@ plt.xlabel("Number of Routing")
plt.xticks(np.arange(len(entropy_list)))
plt.close()
############################################################################################################
# |image3|
#
# .. figure:: https://i.imgur.com/dMvu7p3.png
# :alt:
# Alternatively, we can also watch the evolution of histograms:
import seaborn as sns
import matplotlib.animation as animation
......@@ -216,8 +217,10 @@ ani = animation.FuncAnimation(fig, dist_animate, frames=len(entropy_list), inter
plt.close()
############################################################################################################
# Alternatively, we can also watch the evolution of histograms: |image2|
# Or monitor the how lower level capcules gradually attach to one of the higher level ones:
# |image4|
#
# Or monitor the how lower level capsules gradually attach to one of the
# higher level ones:
import networkx as nx
from networkx.algorithms import bipartite
......@@ -251,14 +254,16 @@ ani2 = animation.FuncAnimation(fig2, weight_animate, frames=len(dist_list), inte
plt.close()
############################################################################################################
# |image3|
# |image5|
#
# The full code of this visulization is provided at
# The full code of this visualization is provided at
# `link <https://github.com/jermainewang/dgl/blob/master/examples/pytorch/capsule/simple_routing.py>`__; the complete
# code that trains on MNIST is at `link <https://github.com/jermainewang/dgl/tree/tutorial/examples/pytorch/capsule>`__.
#
# .. |image0| image:: https://i.imgur.com/mv1W9Rv.png
# .. |image0| image:: https://i.imgur.com/55Ovkdh.png
# .. |image1| image:: https://i.imgur.com/9tc6GLl.png
# .. |image2| image:: https://github.com/VoVAllen/DGL_Capsule/raw/master/routing_dist.gif
# .. |image3| image:: https://github.com/VoVAllen/DGL_Capsule/raw/master/routing_vis.gif
#
# .. |image2| image:: https://i.imgur.com/mv1W9Rv.png
# .. |image3| image:: https://i.imgur.com/dMvu7p3.png
# .. |image4| image:: https://github.com/VoVAllen/DGL_Capsule/raw/master/routing_dist.gif
# .. |image5| image:: https://github.com/VoVAllen/DGL_Capsule/raw/master/routing_vis.gif
.. _tutorials4-index:
Old (new) wines in new bottle
-----------------------------
* **Capsule** `[paper] <https://arxiv.org/abs/1710.09829>`__ `[tutorial] <models/2_capsule.html>`__
`[code] <https://github.com/jermainewang/dgl/tree/master/examples/pytorch/capsule>`__: this new
computer vision model has two key ideas -- enhancing the feature representation
in a vector form (instead of a scalar) called *capsule*, and replacing
maxpooling with dynamic routing. The idea of dynamic routing is to integrate a
lower level capsule to one (or several) of a higher level one with
non-parametric message-passing. We show how the later can be nicely implemented
with DGL APIs.
* **Transformer** `[paper] <https://arxiv.org/abs/1706.03762>`__ `[tutorial (wip)]` `[code (wip)]` and
**Universal Transformer** `[paper] <https://arxiv.org/abs/1807.03819>`__ `[tutorial (wip)]`
`[code (wip)]`: these
two models replace RNN with several layers of multi-head attention to encode
and discover structures among tokens of a sentence. These attention mechanisms
can similarly formulated as graph operations with message-passing.
......@@ -15,85 +15,4 @@ We categorize the models below, providing links to the original code and
tutorial when appropriate. As will become apparent, these models stress the use
of different DGL APIs.
Graph Neural Network and its variant
------------------------------------
* **GCN** `[paper] <https://arxiv.org/abs/1609.02907>`__ `[tutorial] <models/1_gcn.html>`__
`[code] <https://github.com/jermainewang/dgl/blob/master/examples/pytorch/gcn/gcn.py>`__:
this is the vanilla GCN. The tutorial covers the basic uses of DGL APIs.
* **GAT** `[paper] <https://arxiv.org/abs/1710.10903>`__
`[code] <https://github.com/jermainewang/dgl/blob/master/examples/pytorch/gat/gat.py>`__:
the key extension of GAT w.r.t vanilla GCN is deploying multi-head attention
among neighborhood of a node, thus greatly enhances the capacity and
expressiveness of the model.
* **R-GCN** `[paper] <https://arxiv.org/abs/1703.06103>`__ `[tutorial] <models/4_rgcn.html>`__
[code (wip)]: the key
difference of RGNN is to allow multi-edges among two entities of a graph, and
edges with distinct relationships are encoded differently. This is an
interesting extension of GCN that can have a lot of applications of its own.
* **LGNN** `[paper] <https://arxiv.org/abs/1705.08415>`__ `[tutorial (wip)]` `[code (wip)]`:
this model focuses on community detection by inspecting graph structures. It
uses representations of both the orignal graph and its line-graph companion. In
addition to demonstrate how an algorithm can harness multiple graphs, our
implementation shows how one can judiciously mix vanilla tensor operation,
sparse-matrix tensor operations, along with message-passing with DGL.
* **SSE** `[paper] <http://proceedings.mlr.press/v80/dai18a/dai18a.pdf>`__ `[tutorial (wip)]`
`[code] <https://github.com/jermainewang/dgl/blob/master/examples/mxnet/sse/sse_batch.py>`__:
the emphasize here is *giant* graph that cannot fit comfortably on one GPU
card. SSE is an example to illustrate the co-design of both algrithm and
system: sampling to guarantee asymptotic covergence while lowering the
complexity, and batching across samples for maximum parallelism.
Dealing with many small graphs
------------------------------
* **Tree-LSTM** `[paper] <https://arxiv.org/abs/1503.00075>`__ `[tutorial] <models/3_tree-lstm.html>`__
`[code] <https://github.com/jermainewang/dgl/blob/master/examples/pytorch/tree_lstm/tree_lstm.py>`__:
sentences of natural languages have inherent structures, which are thrown away
by treating them simply as sequences. Tree-LSTM is a powerful model that learns
the representation by leveraging prior syntactic structures (e.g. parse-tree).
The challenge to train it well is that simply by padding a sentence to the
maximum length no longer works, since trees of different sentences have
different sizes and topologies. DGL solves this problem by throwing the trees
into a bigger "container" graph, and use message-passing to explore maximum
parallelism. The key API we use is batching.
Generative models
------------------------------
* **DGMG** `[paper] <https://arxiv.org/abs/1803.03324>`__ `[tutorial] <models/5_dgmg.html>`__
`[code] <https://github.com/jermainewang/dgl/tree/master/examples/pytorch/dgmg>`__:
this model belongs to the important family that deals with structural
generation. DGMG is interesting because its state-machine approach is the most
general. It is also very challenging because, unlike Tree-LSTM, every sample
has a dynamic, probability-driven structure that is not available before
training. We are able to progressively leverage intra- and inter-graph
parallelism to steadily improve the performance.
* **JTNN** `[paper] <https://arxiv.org/abs/1802.04364>`__ `[code (wip)]`: unlike DGMG, this
paper generates molecular graphs using the framework of variational
auto-encoder. Perhaps more interesting is its approach to build structure
hierarchically, in the case of molecular, with junction tree as the middle
scaffolding.
Old (new) wines in new bottle
-----------------------------
* **Capsule** `[paper] <https://arxiv.org/abs/1710.09829>`__ `[tutorial] <models/2_capsule.html>`__
`[code] <https://github.com/jermainewang/dgl/tree/master/examples/pytorch/capsule>`__: this new
computer vision model has two key ideas -- enhancing the feature representation
in a vector form (instead of a scalar) called *capsule*, and replacing
maxpooling with dynamic routing. The idea of dynamic routing is to integrate a
lower level capsule to one (or several) of a higher level one with
non-parametric message-passing. We show how the later can be nicely implemented
with DGL APIs.
* **Transformer** `[paper] <https://arxiv.org/abs/1706.03762>`__ `[tutorial (wip)]` `[code (wip)]` and
**Universal Transformer** `[paper] <https://arxiv.org/abs/1807.03819>`__ `[tutorial (wip)]`
`[code (wip)]`: these
two models replace RNN with several layers of multi-head attention to encode
and discover structures among tokens of a sentence. These attention mechanisms
can similarly formulated as graph operations with message-passing.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment