L0_neighbor_sampling_overview.py 4.02 KB
Newer Older
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
"""
Introduction of Neighbor Sampling for GNN Training
==================================================

In :doc:`previous tutorials <1_introduction>` you have learned how to
train GNNs by computing the representations of all nodes on a graph.
However, sometimes your graph is too large to fit the computation of all
nodes in a single GPU.

By the end of this tutorial, you will be able to

-  Understand the pipeline of stochastic GNN training.
-  Understand what is neighbor sampling and why it yields a bipartite
   graph for each GNN layer.
"""


######################################################################
# Message Passing Review
# ----------------------
#
# Recall that in `Gilmer et al. <https://arxiv.org/abs/1704.01212>`__
# (also in :doc:`message passing tutorial <3_message_passing>`), the
# message passing formulation is as follows:
#
# .. math::
#
#
#    m_{u\to v}^{(l)} = M^{(l)}\left(h_v^{(l-1)}, h_u^{(l-1)}, e_{u\to v}^{(l-1)}\right)
#
# .. math::
#
#
#    m_{v}^{(l)} = \sum_{u\in\mathcal{N}(v)}m_{u\to v}^{(l)}
#
# .. math::
#
#
#    h_v^{(l)} = U^{(l)}\left(h_v^{(l-1)}, m_v^{(l)}\right)
#
# where DGL calls :math:`M^{(l)}` the *message function*, :math:`\sum` the
# *reduce function* and :math:`U^{(l)}` the *update function*. Note that
# :math:`\sum` here can represent any function and is not necessarily a
# summation.
#
# Essentially, the :math:`l`-th layer representation of a single node
# depends on the :math:`(l-1)`-th layer representation of the same node,
# as well as the :math:`(l-1)`-th layer representation of the neighboring
# nodes. Those :math:`(l-1)`-th layer representations then depend on the
# :math:`(l-2)`-th layer representation of those nodes, as well as their
# neighbors.
#
# The following animation shows how a 2-layer GNN is supposed to compute
# the output of node 5:
#
# |image1|
#
# You can see that to compute node 5 from the second layer, you will need
# its direct neighbors’ first layer representations (colored in yellow),
# which in turn needs their direct neighbors’ (i.e. node 5’s second-hop
# neighbors’) representations (colored in green).
#
# .. |image1| image:: https://data.dgl.ai/tutorial/img/sampling.gif
#


######################################################################
# Neighbor Sampling Overview
# --------------------------
#
# You can also see from the previous example that computing representation
# for a small number of nodes often requires input features of a
# significantly larger number of nodes. Taking all neighbors for message
# aggregation is often too costly since the nodes needed for input
# features would easily cover a large portion of the graph, especially for
# real-world graphs which are often
# `scale-free <https://en.wikipedia.org/wiki/Scale-free_network>`__.
#
# Neighbor sampling addresses this issue by selecting a subset of the
# neighbors to perform aggregation. For instance, to compute
# :math:`\boldsymbol{h}_8^{(2)}`, you can choose two of the neighbors
# instead of all of them to aggregate, as in the following animation:
#
# |image2|
#
# You can see that this method uses much fewer nodes needed in message
# passing for a single minibatch.
#
# .. |image2| image:: https://data.dgl.ai/tutorial/img/bipartite.gif
#


######################################################################
# You can also notice in the animation above that the computation
# dependencies in the animation above can be described as a series of
# *bipartite graphs*.
# The output nodes are on one side and all the nodes necessary for inputs
# are on the other side. The arrows indicate how the sampled neighbors
# propagates messages to the nodes.
#
# Note that some GNN modules, such as `SAGEConv`, need to use the output
# nodes' features on the previous layer to compute the outputs.  Without
# loss of generality, DGL always includes the output nodes themselves
# in the input nodes.
#


######################################################################
# What’s next?
# ------------
#
# :doc:`Stochastic GNN Training for Node Classification in
# DGL <L1_large_node_classification>`
#