[Doc] overview and others (#222)

* model tutorials overview * update the 1_first * revised overview and summarized "what is DGL" in glance

[Doc] overview and others (#222)
* model tutorials overview * update the 1_first * revised overview and summarized "what is DGL" in glance
10e18ed9 · Minjie Wang · GitHub · 2c170a8c · 10e18ed9 · 10e18ed9
Unverified Commit 10e18ed9 authored Dec 03, 2018 by Minjie Wang Committed by GitHub Dec 03, 2018
4 changed files
--- a/docs/source/index.rst
+++ b/docs/source/index.rst
@@ -3,25 +3,74 @@
   You can adapt this file completely to your liking, but it should at least
   contain the root `toctree` directive.
-Welcome to DGL's documentation!
+Overview of DGL
-===============================
+===============
+Deep Graph Library (DGL) is a Python package built for easy implementation of
+graph neural network model family, on top of existing DL frameworks (e.g.
+Pytorch, MXNet, Gluon etc.).
+DGL reduces the implementation of graph neural networks into declaring a set
+of _functions_ (or _modules_ in PyTorch terminology).  In addition, DGL
+provides:
+* Versatile controls over message passing, ranging from low-level operations
+  such as sending along selected edges and receiving on specific nodes, to
+  high-level control such as graph-wide feature updates.
+* Transparent speed optimization with automatic batching of computations and
+  sparse matrix multiplication.
+* Seamless integration with existing deep learning frameworks.
+* Easy and friendly interfaces for node/edge feature access and graph
+  structure manipulation.
+To begin with, we have prototyped 10 models across various domains:
+semi-supervised learning on graphs (with potentially billions of nodes/edges),
+generative models on graphs, (previously) difficult-to-parallelize tree-based
+models like TreeLSTM, etc. We also implement some conventional models in DGL
+from a new graphical perspective yielding simplicity.
+Relationship of DGL to other frameworks
+---------------------------------------
+DGL is designed to be compatible and agnostic to the existing tensor
+frameworks. It provides a backend adapter interface that allows easy porting
+to other tensor-based, autograd-enabled frameworks. Currently, our prototype
+works with MXNet/Gluon and PyTorch.
+Free software
+-------------
+DGL is free software; you can redistribute it and/or modify it under the terms
+of the Apache License 2.0. We welcome contributions.
+Join us on `GitHub <https://github.com/jermainewang/dgl>`_.
+History
+-------
+Prototype of DGL started in early Spring, 2018, at NYU Shanghai by Prof. Zheng
+Zhang and Quan Gan. Serious development began when Minjie, Lingfan and Prof
+Jinyang Li from NYU's system group joined, flanked by a team of student
+volunteers at NYU Shanghai, Fudan and other universities (Yu, Zihao, Murphy,
+Allen, Qipeng, Qi, Hao), as well as early adopters at the CILVR lab (Jake
+Zhao). Development accelerated when AWS MXNet Science team joined force, with
+Da Zheng, Alex Smola, Haibin Lin, Chao Ma and a number of others. For full
+credit, see `here <https://www.dgl.ai/ack>`_.
 .. toctree::
-   :maxdepth: 2
+   :maxdepth: 1
-   :caption: Contents:
+   :caption: Get Started
+   :glob:
+   install/index
-Get Started
-----------
 .. toctree::
   :maxdepth: 2
+   :caption: Tutorials
+   :glob:
-   install/index
   tutorials/index
-API Reference
-------------
 .. toctree::
   :maxdepth: 2
+   :caption: API Reference
+   :glob:
   api/python/index

--- a/tutorials/1_first.py
+++ b/tutorials/1_first.py
@@ -7,51 +7,42 @@ DGL at a Glance
 **Author**: `Minjie Wang <https://jermainewang.github.io/>`_, Quan Gan, `Jake
 Zhao <https://cs.nyu.edu/~jakezhao/>`_, Zheng Zhang
+DGL is a Python package dedicated to deep learning on graphs, built atop
+existing tensor DL frameworks (e.g. Pytorch, MXNet) and simplifying the
+implementation of graph-based neural networks.
 The goal of this tutorial:
- Understand how DGL builds a graph and performs computation on graph from a
+- Understand how DGL enables computation on graph from a high level.
-  high level.
 - Train a simple graph neural network in DGL to classify nodes in a graph.
 At the end of this tutorial, we hope you get a brief feeling of how DGL works.
-"""
-###############################################################################
+*This tutorial assumes basic familiarity with pytorch.*
-# Why DGL?
+"""
-# ----------------
-# DGL is designed to bring **machine learning** closer to **graph-structured
-# data**. Specifically DGL enables trouble-free implementation of graph neural
-# network (GNN) model family. Unlike PyTorch or Tensorflow, DGL provides
-# friendly APIs to perform the fundamental operations in GNNs such as message
-# passing and reduction. Through DGL, we hope to benefit both researchers
-# trying out new ideas and engineers in production. 
-#
-# *This tutorial assumes basic familiarity with pytorch.*
 ###############################################################################
-# A toy graph: Zachary's Karate Club
+# Step 0: Problem description
-# ----------------------------------
+# ---------------------------
 #
-# We start by creating the well-knowned "Zachary's karate club" social network.
+# We start with the well-known "Zachary's karate club" problem. The karate club
-# The network captures 34 members of a karate club, documenting pairwise links
+# is a social network which captures 34 members and document pairwise links
-# between members who interacted outside the club. The club later splits into
+# between members who interact outside the club.  The club later divides into
 # two communities led by the instructor (node 0) and the club president (node
-# 33). A visualization of the network and the community is as follows:
+# 33). The network is visualized as follows with the color indicating the
+# community:
 #
-# .. image:: http://historicaldataninjas.com/wp-content/uploads/2014/05/karate.jpg 
+# .. image:: https://s3.us-east-2.amazonaws.com/dgl.ai/tutorial/img/karate-club.png
-#    :height: 400px
-#    :width: 500px
 #    :align: center
 #
-# Out task is to **build a graph neural network to predict which side each
+# The task is to predict which side (0 or 33) each member tends to join given
-# member will join.**
+# the social network itself.
 ###############################################################################
-# Build the graph
+# Step 1: Creating a graph in DGL
-# ---------------
+# -------------------------------
-# A graph is built using :class:`~dgl.DGLGraph` class. Here is how we add the 34 members
+# Creating the graph for Zachary's karate club goes as follows:
-# and their interaction edges into the graph.
 import dgl
@@ -59,7 +50,7 @@ def build_karate_club_graph():
    g = dgl.DGLGraph()
    # add 34 nodes into the graph; nodes are labeled from 0~33
    g.add_nodes(34)
-    # all the 78 edges in a list of tuple
+    # all 78 edges as a list of tuples
    edge_list = [(1, 0), (2, 0), (2, 1), (3, 0), (3, 1), (3, 2),
        (4, 0), (5, 0), (6, 0), (6, 4), (6, 5), (7, 0), (7, 1),
        (7, 2), (7, 3), (8, 0), (8, 2), (9, 2), (10, 0), (10, 4),
@@ -73,37 +64,44 @@ def build_karate_club_graph():
        (33, 14), (33, 15), (33, 18), (33, 19), (33, 20), (33, 22),
        (33, 23), (33, 26), (33, 27), (33, 28), (33, 29), (33, 30),
        (33, 31), (33, 32)]
-    # edges in DGL is added by two list of nodes: src and dst
+    # add edges two lists of nodes: src and dst
    src, dst = tuple(zip(*edge_list))
    g.add_edges(src, dst)
-    # edges are directional in DGL; make it bi-directional
+    # edges are directional in DGL; make them bi-directional
    g.add_edges(dst, src)
    return g
 ###############################################################################
-# We can test it to see we have the correct number of nodes and edges:
+# We can print out the number of nodes and edges in our newly constructed graph:
 G = build_karate_club_graph()
 print('We have %d nodes.' % G.number_of_nodes())
 print('We have %d edges.' % G.number_of_edges())
 ###############################################################################
-# We can also visualize it by converting it to a `networkx
+# We can also visualize the graph by converting it to a `networkx
 # <https://networkx.github.io/documentation/stable/>`_ graph:
 import networkx as nx
-nx_G = G.to_networkx()
+# Since the actual graph is undirected, we convert it for visualization
-pos = nx.circular_layout(nx_G)
+# purpose.
-nx.draw(nx_G, pos, with_labels=True)
+nx_G = G.to_networkx().to_undirected()
+# Kamada-Kawaii layout usually looks pretty for arbitrary graphs
+pos = nx.kamada_kawai_layout(nx_G)
+nx.draw(nx_G, pos, with_labels=True, node_color=[[.7, .7, .7]])
 ###############################################################################
-# Assign features
+# Step 2: assign features to nodes or edges
-# ---------------
+# --------------------------------------------
-# Features are tensor data associated with nodes and edges. The features of
+# Graph neural networks associate features with nodes and edges for training.
-# mulitple nodes/edges are batched along the first dimension. Following codes
+# For our classification example, we assign each node's an input feature as a one-hot vector:
-# assign a one-hot encoding feature for each node in the graph (e.g. :math:`v_i` got
+# node :math:`v_i`'s feature vector is :math:`[0,\ldots,1,\dots,0]`,
-# a feature vector :math:`[0,\ldots,1,\dots,0]`, where the :math:`i^{th}` location is one).
+# where the :math:`i^{th}` position is one.
+#
+# In DGL, we can add features for all nodes at once, using a feature tensor that
+# batches node features along the first dimension. This code below adds the one-hot
+# feature for all nodes:
 import torch
@@ -120,46 +118,42 @@ print(G.nodes[2].data['feat'])
 print(G.nodes[[10, 11]].data['feat'])
 ###############################################################################
-# Define a Graph Convolutional Network (GCN)
+# Step 3: define a Graph Convolutional Network (GCN)
-# ------------------------------------------
+# --------------------------------------------------
-# To classify whose side each node will join, we adopt the Graph Convolutional
+# To perform node classification, we use the Graph Convolutional Network
-# Network (GCN) developed by `Kipf and
+# (GCN) developed by `Kipf and Welling <https://arxiv.org/abs/1609.02907>`_. Here
-# Welling <https://arxiv.org/abs/1609.02907>`_. The GCN model can be summarized,
+# we provide the simpliest definition of a GCN framework, but we recommend the
-# in a high-level as follows:
+# reader to read the original paper for more details.
 #
-# - Each node :math:`v_i` has a feature vector :math:`h_i`.
+# - At layer :math:`l`, each node :math:`v_i^l` carries a feature vector :math:`h_i^l`.
-# - Each node accumulates the feature vectors :math:`h_j` from its neighbors, performs
+# - Each layer of the GCN tries to aggregate the features from :math:`u_i^{l}` where
-#   an affine and non-linear transformation to update its own feature.
+#   :math:`u_i`'s are neighborhood nodes to :math:`v` into the next layer representation at
+#   :math:`v_i^{l+1}`. This is followed by an affine transformation with some
+#   non-linearity.
 #
-# A graphical demonstration is displayed below.
+# The above definition of GCN fits into a **message-passing** paradigm: each
+# node will update its own feature with information sent from neighboring
+# nodes. A graphical demonstration is displayed below.
 #
 # .. image:: https://s3.us-east-2.amazonaws.com/dgl.ai/tutorial/1_first/mailbox.png
 #    :alt: mailbox
 #    :align: center
 #
-# The GCN layer can be easily implemented in DGL using the message passing
+# Now, we show that the GCN layer can be easily implemented in DGL.
-# interface. It typically consists of three steps:
-#
-# 1. Define the message function.
-# 2. Define the reduce function.
-# 3. Define how they are triggered using message passing APIs (e.g. ``send`` and ``recv``).
-#
-# Following is how it looks like:
 import torch.nn as nn
 import torch.nn.functional as F
 # Define the message & reduce function
-# NOTE: we ignore the normalization constant c_ij for this tutorial.
+# NOTE: we ignore the GCN's normalization constant c_ij for this tutorial.
 def gcn_message(edges):
    # The argument is a batch of edges.
-    # This computes a message called 'msg' using the source node's feature 'h'.
+    # This computes a (batch of) message called 'msg' using the source node's feature 'h'.
    return {'msg' : edges.src['h']}
 def gcn_reduce(nodes):
    # The argument is a batch of nodes.
-    # This computes the new 'h' features by summing the received 'msg'
+    # This computes the new 'h' features by summing received 'msg' in each node's mailbox.
-    # in mailbox.
    return {'h' : torch.sum(nodes.mailbox['msg'], dim=1)}
 # Define the GCNLayer module
@@ -172,8 +166,9 @@ class GCNLayer(nn.Module):
        # g is the graph and the inputs is the input node features
        # first set the node features
        g.ndata['h'] = inputs
-        # trigger message passing on all the edges and nodes
+        # trigger message passing on all edges 
        g.send(g.edges(), gcn_message)
+        # trigger aggregation at all nodes
        g.recv(g.nodes(), gcn_reduce)
        # get the result node features
        h = g.ndata.pop('h')
@@ -181,12 +176,15 @@ class GCNLayer(nn.Module):
        return self.linear(h)
 ###############################################################################
-# We then define a neural network that contains two GCN layers:
+# In general, the nodes send information computed via the *message functions*,
+# and aggregates incoming information with the *reduce functions*.
+#
+# We then define a deeper GCN model that contains two GCN layers:
 # Define a 2-layer GCN model
-class Net(nn.Module):
+class GCN(nn.Module):
    def __init__(self, in_feats, hidden_size, num_classes):
-        super(Net, self).__init__()
+        super(GCN, self).__init__()
        self.gcn1 = GCNLayer(in_feats, hidden_size)
        self.gcn2 = GCNLayer(hidden_size, num_classes)
@@ -195,26 +193,29 @@ class Net(nn.Module):
        h = torch.relu(h)
        h = self.gcn2(g, h)
        return h
-# input_feature_size=34, hidden_size=5, num_classes=2
+# The first layer transforms input features of size of 34 to a hidden size of 5.
-net = Net(34, 5, 2)
+# The second layer transforms the hidden layer and produces output features of
+# size 2, corresponding to the two groups of the karate club.
+net = GCN(34, 5, 2)
 ###############################################################################
-# Train the GCN model to predict community
+# Step 4: data preparation and initialization
-# ----------------------------------------
+# -------------------------------------------
 #
-# To prepare the input features and labels, again, we adopt a 
+# We use one-hot vectors to initialize the node features. Since this is a
-# semi-supervised setting. Each node is initialized by an
+# semi-supervised setting, only the instructor (node 0) and the club president
-# one-hot encoding, and only the instructor (node 0) and the club president
+# (node 33) are assigned labels. The implementation is available as follow.
-# (node 33) are labeled.
 inputs = torch.eye(34)
 labeled_nodes = torch.tensor([0, 33])  # only the instructor and the president nodes are labeled
 labels = torch.tensor([0, 1])  # their labels are different
 ###############################################################################
-# The training loop is no fancier than other NN models. We (1) create an optimizer,
+# Step 5: train then visualize
-# (2) feed the inputs to the model, (3) calculate the loss and (4) use autograd
+# ----------------------------
-# to optimize the model.
+# The training loop is exactly the same as other PyTorch models.
+# We (1) create an optimizer, (2) feed the inputs to the model,
+# (3) calculate the loss and (4) use autograd to optimize the model.
 optimizer = torch.optim.Adam(net.parameters(), lr=0.01)
 all_logits = []
@@ -233,8 +234,13 @@ for epoch in range(30):
    print('Epoch %d | Loss: %.4f' % (epoch, loss.item()))
 ###############################################################################
-# Since the model produces a 2-dimensional vector for each node, we can
+# This is a rather toy example, so it does not even have a validation or test
-# visualize it very easily.
+# set. Instead, Since the model produces an output feature of size 2 for each node, we can
+# visualize by plotting the output feature in a 2D space.
+# The following code animates the training process from initial guess
+# (where the nodes are not classified correctly at all) to the end
+# (where the nodes are linearly separable).
 import matplotlib.animation as animation
 import matplotlib.pyplot as plt
@@ -253,10 +259,6 @@ def draw(i):
    nx.draw_networkx(nx_G.to_undirected(), pos, node_color=colors,
            with_labels=True, node_size=300, ax=ax)
-###############################################################################
-# We first plot the initial guess before training. As you can see, the nodes
-# are not classified correctly.
 fig = plt.figure(dpi=150)
 fig.clf()
 ax = fig.subplots()
@@ -271,7 +273,7 @@ plt.close()
 ###############################################################################
 # The following animation shows how the model correctly predicts the community
-# after training.
+# after a series of training epochs.
 ani = animation.FuncAnimation(fig, draw, frames=len(all_logits), interval=200)
@@ -284,5 +286,6 @@ ani = animation.FuncAnimation(fig, draw, frames=len(all_logits), interval=200)
 ###############################################################################
 # Next steps
 # ----------
+# 
 # In the :doc:`next tutorial <2_basics>`, we will go through some more basics
 # of DGL, such as reading and writing node/edge features.
--- a/tutorials/README.txt
+++ b/tutorials/README.txt
-Tutorials
+Basic Tutorials
-=========
+===============
-DGL tutorials and examples
+These tutorials conver the basics of DGL.
\ No newline at end of file
--- a/tutorials/models/README.txt
+++ b/tutorials/models/README.txt
-Graph-based DNN models in DGL
+Graph-based Neural Network Models
-=============================
+=================================
-Graph-based DNN models in DGL
+We developed DGL with a broad range of applications in mind. Building
+state-of-art models forces us to think hard on the most common and useful APIs,
+learn the hard lessons, and push the system design.
+We have prototyped altogether 10 different models, all of them are ready to run
+out-of-box and some of them are very new graph-based algorithms. In most of the
+cases, they demonstrate the performance, flexibility, and expressiveness of
+DGL. For where we still fall in short, these exercises point to future
+directions.
+We categorize the models below, providing links to the original code and
+tutorial when appropriate. As will become apparent, these models stress the use
+of different DGL APIs.
+Graph Neural Network and its variant
+------------------------------------
+* **GCN** `[paper] <https://arxiv.org/abs/1609.02907>`__ `[tutorial] <models/1_gcn.html>`__
+  `[code] <https://github.com/jermainewang/dgl/blob/master/examples/pytorch/gcn/gcn.py>`__:
+  this is the vanilla GCN. The tutorial covers the basic uses of DGL APIs.
+* **GAT** `[paper] <https://arxiv.org/abs/1710.10903>`__
+  `[code] <https://github.com/jermainewang/dgl/blob/master/examples/pytorch/gat/gat.py>`__:
+  the key extension of GAT w.r.t vanilla GCN is deploying multi-head attention
+  among neighborhood of a node, thus greatly enhances the capacity and
+  expressiveness of the model.
+* **R-GCN** `[paper] <https://arxiv.org/abs/1703.06103>`__ `[tutorial] <models/4_rgcn.html>`__
+  [code (wip)]: the key
+  difference of RGNN is to allow multi-edges among two entities of a graph, and
+  edges with distinct relationships are encoded differently. This is an
+  interesting extension of GCN that can have a lot of applications of its own.
+* **LGNN** `[paper] <https://arxiv.org/abs/1705.08415>`__ `[tutorial (wip)]` `[code (wip)]`:
+  this model focuses on community detection by inspecting graph structures. It
+  uses representations of both the orignal graph and its line-graph companion. In
+  addition to demonstrate how an algorithm can harness multiple graphs, our
+  implementation shows how one can judiciously mix vanilla tensor operation,
+  sparse-matrix tensor operations, along with message-passing with DGL. 
+* **SSE** `[paper] <http://proceedings.mlr.press/v80/dai18a/dai18a.pdf>`__ `[tutorial (wip)]`
+  `[code] <https://github.com/jermainewang/dgl/blob/master/examples/mxnet/sse/sse_batch.py>`__:
+  the emphasize here is *giant* graph that cannot fit comfortably on one GPU
+  card. SSE is an example to illustrate the co-design of both algrithm and
+  system: sampling to guarantee asymptotic covergence while lowering the
+  complexity, and batching across samples for maximum parallelism.
+Dealing with many small graphs
+------------------------------
+* **Tree-LSTM** `[paper] <https://arxiv.org/abs/1503.00075>`__ `[tutorial] <models/3_tree-lstm.html>`__
+  `[code] <https://github.com/jermainewang/dgl/blob/master/examples/pytorch/tree_lstm/tree_lstm.py>`__:
+  sentences of natural languages have inherent structures, which are thrown away
+  by treating them simply as sequences. Tree-LSTM is a powerful model that learns
+  the representation by leveraging prior syntactic structures (e.g. parse-tree).
+  The challenge to train it well is that simply by padding a sentence to the
+  maximum length no longer works, since trees of different sentences have
+  different sizes and topologies. DGL solves this problem by throwing the trees
+  into a bigger "container" graph, and use message-passing to explore maximum
+  parallelism. The key API we use is batching.
+Generative models
+------------------------------
+* **DGMG** `[paper] <https://arxiv.org/abs/1803.03324>`__ `[tutorial] <models/5_dgmg.html>`__
+  `[code] <https://github.com/jermainewang/dgl/tree/master/examples/pytorch/dgmg>`__:
+  this model belongs to the important family that deals with structural
+  generation. DGMG is interesting because its state-machine approach is the most
+  general. It is also very challenging because, unlike Tree-LSTM, every sample
+  has a dynamic, probability-driven structure that is not available before
+  training. We are able to progressively leverage intra- and inter-graph
+  parallelism to steadily improve the performance.
+* **JTNN** `[paper] <https://arxiv.org/abs/1802.04364>`__ `[code (wip)]`: unlike DGMG, this
+  paper generates molecular graphs using the framework of variational
+  auto-encoder. Perhaps more interesting is its approach to build structure
+  hierarchically, in the case of molecular, with junction tree as the middle
+  scaffolding.
+Old (new) wines in new bottle
+-----------------------------
+* **Capsule** `[paper] <https://arxiv.org/abs/1710.09829>`__ `[tutorial] <models/2_capsule.html>`__
+  `[code] <https://github.com/jermainewang/dgl/tree/master/examples/pytorch/capsule>`__: this new
+  computer vision model has two key ideas -- enhancing the feature representation
+  in a vector form (instead of a scalar) called *capsule*, and replacing
+  maxpooling with dynamic routing. The idea of dynamic routing is to integrate a
+  lower level capsule to one (or several) of a higher level one with
+  non-parametric message-passing. We show how the later can be nicely implemented
+  with DGL APIs.
+* **Transformer** `[paper] <https://arxiv.org/abs/1706.03762>`__ `[tutorial (wip)]` `[code (wip)]` and
+  **Universal Transformer** `[paper] <https://arxiv.org/abs/1807.03819>`__ `[tutorial (wip)]`
+  `[code (wip)]`: these
+  two models replace RNN with several layers of multi-head attention to encode
+  and discover structures among tokens of a sentence. These attention mechanisms
+  can similarly formulated as graph operations with message-passing.