Unverified Commit e0189397 authored by Quan (Andy) Gan's avatar Quan (Andy) Gan Committed by GitHub
Browse files

[Tutorial] New tutorials for small graphs (#2482)

* new tutorials for small graphs

* address changes and use GraphDataLoader

* fix and add data

* fix load_data

* style fixes

* Update 5_graph_classification.py
parent 16169f3a
...@@ -35,6 +35,8 @@ when the graph is heterogeneous. ...@@ -35,6 +35,8 @@ when the graph is heterogeneous.
DGLGraph.metagraph DGLGraph.metagraph
DGLGraph.to_canonical_etype DGLGraph.to_canonical_etype
.. _apigraph-querying-graph-structure:
Querying graph structure Querying graph structure
------------------------ ------------------------
......
...@@ -196,8 +196,9 @@ intersphinx_mapping = { ...@@ -196,8 +196,9 @@ intersphinx_mapping = {
from sphinx_gallery.sorting import FileNameSortKey from sphinx_gallery.sorting import FileNameSortKey
examples_dirs = ['../../tutorials/basics', examples_dirs = ['../../tutorials/basics',
'../../tutorials/models'] # path to find sources '../../tutorials/models',
gallery_dirs = ['tutorials/basics', 'tutorials/models'] # path to generate docs '../../new-tutorial'] # path to find sources
gallery_dirs = ['tutorials/basics', 'tutorials/models', 'new-tutorial'] # path to generate docs
reference_url = { reference_url = {
'dgl' : None, 'dgl' : None,
'numpy': 'http://docs.scipy.org/doc/numpy/', 'numpy': 'http://docs.scipy.org/doc/numpy/',
......
...@@ -42,7 +42,7 @@ Getting Started ...@@ -42,7 +42,7 @@ Getting Started
.. ..
Follow the :doc:`instructions<install/index>` to install DGL. Follow the :doc:`instructions<install/index>` to install DGL.
:doc:`DGL at a glance<tutorials/basics/1_first>` is the most common place to get started with. :doc:`<new-tutorial/1_introduction>` is the most common place to get started with.
It offers a broad experience of using DGL for deep learning on graph data. It offers a broad experience of using DGL for deep learning on graph data.
API reference document lists more endetailed specifications of each API and GNN modules, API reference document lists more endetailed specifications of each API and GNN modules,
...@@ -50,9 +50,11 @@ Getting Started ...@@ -50,9 +50,11 @@ Getting Started
You can learn other basic concepts of DGL through the dedicated tutorials. You can learn other basic concepts of DGL through the dedicated tutorials.
* Learn constructing graphs and set/get node and edge features :doc:`here<tutorials/basics/2_basics>`. * Learn constructing, saving and loading graphs with node and edge features :doc:`here<new-tutorial/2_dglgraph>`.
* Learn performing computation on graph using message passing :doc:`here<tutorials/basics/3_pagerank>`. * Learn performing computation on graph using message passing :doc:`here<new-tutorial/3_message_passing>`.
* Learn processing multiple graph samples in a batch :doc:`here<tutorials/basics/4_batch>`. * Learn link prediction with DGL :doc:`here<new-tutorial/4_link_predict>`.
* Learn graph classification with DGL :doc:`here<new-tutorial/5_graph_classification>`.
* Learn creating your own dataset for DGL :doc:`here<new-tutorial/6_load_data>`.
* Learn working with heterogeneous graph data :doc:`here<tutorials/basics/5_hetero>`. * Learn working with heterogeneous graph data :doc:`here<tutorials/basics/5_hetero>`.
End-to-end model tutorials are other good starting points for learning DGL and popular End-to-end model tutorials are other good starting points for learning DGL and popular
...@@ -79,7 +81,7 @@ Getting Started ...@@ -79,7 +81,7 @@ Getting Started
install/index install/index
install/backend install/backend
tutorials/basics/1_first new-tutorial/1_introduction
.. toctree:: .. toctree::
:maxdepth: 2 :maxdepth: 2
......
"""
A Blitz Introduction to DGL - Node Classification
=================================================
GNNs are powerful tools for many machine learning tasks on graphs. In
this introductory tutorial, you will learn the basic workflow of using
GNNs for node classification, i.e. predicting the category of a node in
a graph.
By completing this tutorial, you will be able to
- Load a DGL-provided dataset.
- Build a GNN model with DGL-provided neural network modules.
- Train and evaluate a GNN model for node classification on either CPU
or GPU.
This tutorial assumes that you have experience in building neural
networks with PyTorch.
(Time estimate: 13 minutes)
"""
import dgl
import torch
import torch.nn as nn
import torch.nn.functional as F
######################################################################
# Overview of Node Classification with GNN
# ----------------------------------------
#
# One of the most popular and widely adopted tasks on graph data is node
# classification, where a model needs to predict the ground truth category
# of each node. Before graph neural networks, many proposed methods are
# using either connectivity alone (such as DeepWalk or node2vec), or simple
# combinations of connectivity and the node's own features. GNNs, by
# contrast, offers an opportunity to obtain node representations by
# combining the connectivity and features of a *local neighborhood*.
#
# `Kipf et
# al., <https://arxiv.org/abs/1609.02907>`__ is an example that formulates
# the node classification problem as a semi-supervised node classification
# task. With the help of only a small portion of labeled nodes, a graph
# neural network (GNN) can accurately predict the node category of the
# others.
#
# This tutorial will show how to build such a GNN for semi-supervised node
# classification with only a small number of labels on the Cora
# dataset,
# a citation network with papers as nodes and citations as edges. The task
# is to predict the category of a given paper. Each paper node contains a
# word count vector as its features, normalized so that they sum up to one,
# as described in Section 5.2 of
# `the paper <https://arxiv.org/abs/1609.02907>`__.
#
# Loading Cora Dataset
# --------------------
#
import dgl.data
dataset = dgl.data.CoraGraphDataset()
print('Number of categories:', dataset.num_classes)
######################################################################
# A DGL Dataset object may contain one or multiple graphs. The Cora
# dataset used in this tutorial only consists of one single graph.
#
g = dataset[0]
######################################################################
# A DGL graph can store node features and edge features in two
# dictionary-like attributes called ``ndata`` and ``edata``.
# In the DGL Cora dataset, the graph contains the following node features:
#
# - ``train_mask``: A boolean tensor indicating whether the node is in the
# training set.
#
# - ``val_mask``: A boolean tensor indicating whether the node is in the
# validation set.
#
# - ``test_mask``: A boolean tensor indicating whether the node is in the
# test set.
#
# - ``label``: The ground truth node category.
#
# - ``feat``: The node features.
#
print('Node features')
print(g.ndata)
print('Edge features')
print(g.edata)
######################################################################
# Defining a Graph Convolutional Network (GCN)
# --------------------------------------------
#
# This tutorial will build a two-layer `Graph Convolutional Network
# (GCN) <http://tkipf.github.io/graph-convolutional-networks/>`__. Each
# layer computes new node representations by aggregating neighbor
# information.
#
# To build a multi-layer GCN you can simply stack ``dgl.nn.GraphConv``
# modules, which inherit ``torch.nn.Module``.
#
from dgl.nn import GraphConv
class GCN(nn.Module):
def __init__(self, in_feats, h_feats, num_classes):
super(GCN, self).__init__()
self.conv1 = GraphConv(in_feats, h_feats)
self.conv2 = GraphConv(h_feats, num_classes)
def forward(self, g, in_feat):
h = self.conv1(g, in_feat)
h = F.relu(h)
h = self.conv2(g, h)
return h
# Create the model with given dimensions
model = GCN(g.ndata['feat'].shape[1], 16, dataset.num_classes)
######################################################################
# DGL provides implementation of many popular neighbor aggregation
# modules. You can easily invoke them with one line of code.
#
######################################################################
# Training the GCN
# ----------------
#
# Training this GCN is similar to training other PyTorch neural networks.
#
def train(g, model):
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)
best_val_acc = 0
best_test_acc = 0
features = g.ndata['feat']
labels = g.ndata['label']
train_mask = g.ndata['train_mask']
val_mask = g.ndata['val_mask']
test_mask = g.ndata['test_mask']
for e in range(100):
# Forward
logits = model(g, features)
# Compute prediction
pred = logits.argmax(1)
# Compute loss
# Note that you should only compute the losses of the nodes in the training set.
loss = F.cross_entropy(logits[train_mask], labels[train_mask])
# Compute accuracy on training/validation/test
train_acc = (pred[train_mask] == labels[train_mask]).float().mean()
val_acc = (pred[val_mask] == labels[val_mask]).float().mean()
test_acc = (pred[test_mask] == labels[test_mask]).float().mean()
# Save the best validation accuracy and the corresponding test accuracy.
if best_val_acc < val_acc:
best_val_acc = val_acc
best_test_acc = test_acc
# Backward
optimizer.zero_grad()
loss.backward()
optimizer.step()
if e % 5 == 0:
print('In epoch {}, loss: {:.3f}, val acc: {:.3f} (best {:.3f}), test acc: {:.3f} (best {:.3f})'.format(
e, loss, val_acc, best_val_acc, test_acc, best_test_acc))
model = GCN(g.ndata['feat'].shape[1], 16, dataset.num_classes)
train(g, model)
######################################################################
# Training on GPU
# ---------------
#
# Training on GPU requires to put both the model and the graph onto GPU
# with the ``to`` method, similar to what you will do in PyTorch.
#
g = g.to('cuda')
model = GCN(g.ndata['feat'].shape[1], 16, dataset.num_classes).to('cuda')
train(g, model)
######################################################################
# What’s next?
# ------------
#
# - :doc:`How does DGL represent a graph <2_dglgraph>`?
# - :doc:`Write your own GNN module <3_message_passing>`.
# - :doc:`Link prediction (predicting existence of edges) on full
# graph <4_link_predict>`.
# - :doc:`Graph classification <5_graph_classification>`.
# - :doc:`Make your own dataset <6_load_data>`.
# - :ref:`The list of supported graph convolution
# modules <apinn-pytorch>`.
# - :ref:`The list of datasets provided by DGL <apidata>`.
#
"""
How Does DGL Represent A Graph?
===============================
By the end of this tutorial you will be able to:
- Construct a graph in DGL from scratch.
- Assign node and edge features to a graph.
- Query properties of a DGL graph such as node degrees and
connectivity.
- Transform a DGL graph into another graph.
- Load and save DGL graphs.
(Time estimate: 16 minutes)
"""
######################################################################
# DGL Graph Construction
# ----------------------
#
# DGL represents a directed graph as a ``DGLGraph`` object. You can
# construct a graph by specifying the number of nodes in the graph as well
# as the list of source and destination nodes. Nodes in the graph have
# consecutive IDs starting from 0.
#
# For instance, the following code constructs a directed star graph with 5
# leaves. The center node's ID is 0. The edges go from the
# center node to the leaves.
#
import dgl
import numpy as np
import torch
g = dgl.graph(([0, 0, 0, 0, 0], [1, 2, 3, 4, 5]), num_nodes=6)
# Equivalently, PyTorch LongTensors also work.
g = dgl.graph((torch.LongTensor([0, 0, 0, 0, 0]), torch.LongTensor([1, 2, 3, 4, 5])), num_nodes=6)
# You can omit the number of nodes argument if you can tell the number of nodes from the edge list alone.
g = dgl.graph(([0, 0, 0, 0, 0], [1, 2, 3, 4, 5]))
######################################################################
# Edges in the graph have consecutive IDs starting from 0, and are
# in the same order as the list of source and destination nodes during
# creation.
#
# Print the source and destination nodes of every edge.
print(g.edges())
######################################################################
# .. note::
#
# ``DGLGraph``'s are always directed to best fit the computation
# pattern of graph neural networks, where the messages sent
# from one node to the other are often different between both
# directions. If you want to handle undirected graphs, you may consider
# treating it as a bidirectional graph. See `Graph
# Transformations`_ for an example of making
# a bidirectional graph.
#
######################################################################
# Assigning Node and Edge Features to Graph
# -----------------------------------------
#
# Many graph data contain attributes on nodes and edges.
# Although the types of node and edge attributes can be arbitrary in real
# world, ``DGLGraph`` only accepts attributes stored in tensors (with
# numerical contents). Consequently, an attribute of all the nodes or
# edges must have the same shape. In the context of deep learning, those
# attributes are often called *features*.
#
# You can assign and retrieve node and edge features via ``ndata`` and
# ``edata`` interface.
#
# Assign a 3-dimensional node feature vector for each node.
g.ndata['x'] = torch.randn(6, 3)
# Assign a 4-dimensional edge feature vector for each edge.
g.edata['a'] = torch.randn(5, 4)
# Assign a 5x4 node feature matrix for each node. Node and edge features in DGL can be multi-dimensional.
g.ndata['y'] = torch.randn(6, 5, 4)
print(g.edata['a'])
######################################################################
# .. note::
#
# The vast development of deep learning has provided us many
# ways to encode various types of attributes into numerical features.
# Here are some general suggestions:
#
# - For categorical attributes (e.g. gender, occupation), consider
# converting them to integers or one-hot encoding.
# - For variable length string contents (e.g. news article, quote),
# consider applying a language model.
# - For images, consider applying a vision model such as CNNs.
#
# You can find plenty of materials on how to encode such attributes
# into a tensor in the `PyTorch Deep Learning
# Tutorials <https://pytorch.org/tutorials/>`__.
#
######################################################################
# Querying Graph Structures
# -------------------------
#
# ``DGLGraph`` object provides various methods to query a graph structure.
#
print(g.num_nodes())
print(g.num_edges())
# Out degrees of the center node
print(g.out_degrees(0))
# In degrees of the center node - note that the graph is directed so the in degree should be 0.
print(g.in_degrees(0))
######################################################################
# Graph Transformations
# ---------------------
#
######################################################################
# DGL provides many APIs to transform a graph to another such as
# extracting a subgraph:
#
# Induce a subgraph from node 0, node 1 and node 3 from the original graph.
sg1 = g.subgraph([0, 1, 3])
# Induce a subgraph from edge 0, edge 1 and edge 3 from the original graph.
sg2 = g.edge_subgraph([0, 1, 3])
######################################################################
# You can obtain the node/edge mapping from the subgraph to the original
# graph by looking into the node feature ``dgl.NID`` or edge feature
# ``dgl.EID`` in the new graph.
#
# The original IDs of each node in sg1
print(sg1.ndata[dgl.NID])
# The original IDs of each edge in sg1
print(sg1.edata[dgl.EID])
# The original IDs of each node in sg2
print(sg2.ndata[dgl.NID])
# The original IDs of each edge in sg2
print(sg2.edata[dgl.EID])
######################################################################
# ``subgraph`` and ``edge_subgraph`` also copies the original features
# to the subgraph:
#
# The original node feature of each node in sg1
print(sg1.ndata['x'])
# The original edge feature of each node in sg1
print(sg1.edata['a'])
# The original node feature of each node in sg2
print(sg2.ndata['x'])
# The original edge feature of each node in sg2
print(sg2.edata['a'])
######################################################################
# Another common transformation is to add a reverse edge for each edge in
# the original graph with ``dgl.add_reverse_edges``.
#
# .. note::
#
# If you have an undirected graph, it is better to convert it
# into a bidirectional graph first via adding reverse edges.
#
newg = dgl.add_reverse_edges(g)
newg.edges()
######################################################################
# Loading and Saving Graphs
# -------------------------
#
# You can save a graph or a list of graphs via ``dgl.save_graphs`` and
# load them back with ``dgl.load_graphs``.
#
# Save graphs
dgl.save_graphs('graph.dgl', g)
dgl.save_graphs('graphs.dgl', [g, sg1, sg2])
# Load graphs
(g,), _ = dgl.load_graphs('graph.dgl')
print(g)
(g, sg1, sg2), _ = dgl.load_graphs('graphs.dgl')
print(g)
print(sg1)
print(sg2)
######################################################################
# What’s next?
# ------------
#
# - See
# :ref:`here <apigraph-querying-graph-structure>`
# for a list of graph structure query APIs.
# - See
# :ref:`here <api-subgraph-extraction>`
# for a list of subgraph extraction routines.
# - See
# :ref:`here <api-transform>`
# for a list of graph transformation routines.
# - API reference of :func:`dgl.save_graphs`
# and
# :func:`dgl.load_graphs`
#
"""
Write your own GNN module
=========================
Sometimes, your model goes beyond simply stacking existing GNN modules.
For example, you would like to invent a new way of aggregating neighbor
information by considering node importance or edge weights.
By the end of this tutorial you will be able to
- Understand DGL’s message passing APIs.
- Implement GraphSAGE convolution module by your own.
This tutorial assumes that you already know :doc:`the basics of training a
GNN for node classification <1_introduction>`.
(Time estimate: 10 minutes)
"""
import dgl
import torch
import torch.nn as nn
import torch.nn.functional as F
######################################################################
# Message passing and GNNs
# ------------------------
#
# DGL follows the *message passing paradigm* inspired by the Message
# Passing Neural Network proposed by `Gilmer et
# al. <https://arxiv.org/abs/1704.01212>`__ Essentially, they found many
# GNN models can fit into the following framework:
#
# .. math::
#
#
# m_{u\to v}^{(l)} = M^{(l)}\left(h_v^{(l-1)}, h_u^{(l-1)}, e_{u\to v}^{(l-1)}\right)
#
# .. math::
#
#
# m_{v}^{(l)} = \sum_{u\in\mathcal{N}(v)}m_{u\to v}^{(l)}
#
# .. math::
#
#
# h_v^{(l)} = U^{(l)}\left(h_v^{(l-1)}, m_v^{(l)}\right)
#
# where DGL calls :math:`M^{(l)}` the *message function*, :math:`\sum` the
# *reduce function* and :math:`U^{(l)}` the *update function*. Note that
# :math:`\sum` here can represent any function and is not necessarily a
# summation.
#
######################################################################
# For example, the `GraphSAGE convolution (Hamilton et al.,
# 2017) <https://cs.stanford.edu/people/jure/pubs/graphsage-nips17.pdf>`__
# takes the following mathematical form:
#
# .. math::
#
#
# h_{\mathcal{N}(v)}^k\leftarrow \text{Average}\{h_u^{k-1},\forall u\in\mathcal{N}(v)\}
#
# .. math::
#
#
# h_v^k\leftarrow \text{ReLU}\left(W^k\cdot \text{CONCAT}(h_v^{k-1}, h_{\mathcal{N}(v)}^k) \right)
#
# You can see that message passing is directional: the message sent from
# one node :math:`u` to other node :math:`v` is not necessarily the same
# as the other message sent from node :math:`v` to node :math:`u` in the
# opposite direction.
#
# Although DGL has builtin support of GraphSAGE via
# :class:```dgl.nn.SAGEConv`` <dgl.nn.pytorch.SAGEConv>`,
# here is how you can implement GraphSAGE convolution in DGL by your own.
#
import dgl.function as fn
class SAGEConv(nn.Module):
"""Graph convolution module used by the GraphSAGE model.
Parameters
----------
in_feat : int
Input feature size.
out_feat : int
Output feature size.
"""
def __init__(self, in_feat, out_feat):
super(SAGEConv, self).__init__()
# A linear submodule for projecting the input and neighbor feature to the output.
self.linear = nn.Linear(in_feat * 2, out_feat)
def forward(self, g, h):
"""Forward computation
Parameters
----------
g : Graph
The input graph.
h : Tensor
The input node feature.
"""
with g.local_scope():
g.ndata['h'] = h
# update_all is a message passing API.
g.update_all(message_func=fn.copy_u('h', 'm'), reduce_func=fn.mean('m', 'h_N'))
h_N = g.ndata['h_N']
h_total = torch.cat([h, h_N], dim=1)
return self.linear(h_total)
######################################################################
# The central piece in this code is the
# :func:```g.update_all`` <dgl.DGLGraph.update_all>`
# function, which gathers and averages the neighbor features. There are
# three concepts here:
#
# * Message function ``fn.copy_u('h', 'm')`` that
# copies the node feature under name ``'h'`` as *messages* sent to
# neighbors.
#
# * Reduce function ``fn.mean('m', 'h_N')`` that averages
# all the received messages under name ``'m'`` and saves the result as a
# new node feature ``'h_N'``.
#
# * ``update_all`` tells DGL to trigger the
# message and reduce functions for all the nodes and edges.
#
######################################################################
# Afterwards, you can stack your own GraphSAGE convolution layers to form
# a multi-layer GraphSAGE network.
#
class Model(nn.Module):
def __init__(self, in_feats, h_feats, num_classes):
super(Model, self).__init__()
self.conv1 = SAGEConv(in_feats, h_feats)
self.conv2 = SAGEConv(h_feats, num_classes)
def forward(self, g, in_feat):
h = self.conv1(g, in_feat)
h = F.relu(h)
h = self.conv2(g, h)
return h
######################################################################
# Training loop
# ~~~~~~~~~~~~~
# The following code for data loading and training loop is directly copied
# from the introduction tutorial.
#
import dgl.data
dataset = dgl.data.CoraGraphDataset()
g = dataset[0]
def train(g, model):
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)
all_logits = []
best_val_acc = 0
best_test_acc = 0
features = g.ndata['feat']
labels = g.ndata['label']
train_mask = g.ndata['train_mask']
val_mask = g.ndata['val_mask']
test_mask = g.ndata['test_mask']
for e in range(200):
# Forward
logits = model(g, features)
# Compute prediction
pred = logits.argmax(1)
# Compute loss
# Note that we should only compute the losses of the nodes in the training set,
# i.e. with train_mask 1.
loss = F.cross_entropy(logits[train_mask], labels[train_mask])
# Compute accuracy on training/validation/test
train_acc = (pred[train_mask] == labels[train_mask]).float().mean()
val_acc = (pred[val_mask] == labels[val_mask]).float().mean()
test_acc = (pred[test_mask] == labels[test_mask]).float().mean()
# Save the best validation accuracy and the corresponding test accuracy.
if best_val_acc < val_acc:
best_val_acc = val_acc
best_test_acc = test_acc
# Backward
optimizer.zero_grad()
loss.backward()
optimizer.step()
all_logits.append(logits.detach())
if e % 5 == 0:
print('In epoch {}, loss: {:.3f}, val acc: {:.3f} (best {:.3f}), test acc: {:.3f} (best {:.3f})'.format(
e, loss, val_acc, best_val_acc, test_acc, best_test_acc))
model = Model(g.ndata['feat'].shape[1], 16, dataset.num_classes)
train(g, model)
######################################################################
# More customization
# ------------------
#
# In DGL, we provide many built-in message and reduce functions under the
# ``dgl.function`` package. You can find more details in :ref:`the API
# doc <apifunction>`.
#
######################################################################
# These APIs allow one to quickly implement new graph convolution modules.
# For example, the following implements a new ``SAGEConv`` that aggregates
# neighbor representations using a weighted average. Note that ``edata``
# member can hold edge features which can also take part in message
# passing.
#
class WeightedSAGEConv(nn.Module):
"""Graph convolution module used by the GraphSAGE model with edge weights.
Parameters
----------
in_feat : int
Input feature size.
out_feat : int
Output feature size.
"""
def __init__(self, in_feat, out_feat):
super(WeightedSAGEConv, self).__init__()
# A linear submodule for projecting the input and neighbor feature to the output.
self.linear = nn.Linear(in_feat * 2, out_feat)
def forward(self, g, h, w):
"""Forward computation
Parameters
----------
g : Graph
The input graph.
h : Tensor
The input node feature.
w : Tensor
The edge weight.
"""
with g.local_scope():
g.ndata['h'] = h
g.edata['w'] = w
g.update_all(message_func=fn.u_mul_e('h', 'w', 'm'), reduce_func=fn.mean('m', 'h_N'))
h_N = g.ndata['h_N']
h_total = torch.cat([h, h_N], dim=1)
return self.linear(h_total)
######################################################################
# Because the graph in this dataset does not have edge weights, we
# manually assign all edge weights to one in the ``forward()`` function of
# the model. You can replace it with your own edge weights.
#
class Model(nn.Module):
def __init__(self, in_feats, h_feats, num_classes):
super(Model, self).__init__()
self.conv1 = WeightedSAGEConv(in_feats, h_feats)
self.conv2 = WeightedSAGEConv(h_feats, num_classes)
def forward(self, g, in_feat):
h = self.conv1(g, in_feat, torch.ones(g.num_edges()).to(g.device))
h = F.relu(h)
h = self.conv2(g, h, torch.ones(g.num_edges()).to(g.device))
return h
model = Model(g.ndata['feat'].shape[1], 16, dataset.num_classes)
train(g, model)
######################################################################
# Even more customization by user-defined function
# ------------------------------------------------
#
# DGL allows user-defined message and reduce function for the maximal
# expressiveness. Here is a user-defined message function that is
# equivalent to ``fn.u_mul_e('h', 'w', 'm')``.
#
def u_mul_e_udf(edges):
return {'m' : edges.src['h'] * edges.data['w']}
######################################################################
# ``edges`` has three members: ``src``, ``data`` and ``dst``, representing
# the source node feature, edge feature, and destination node feature for
# all edges.
#
######################################################################
# You can also write your own reduce function. For example, the following
# is equivalent to the builtin ``fn.sum('m', 'h')`` function that sums up
# the incoming messages:
#
def sum_udf(nodes):
return {'h': nodes.mailbox['m'].sum(1)}
######################################################################
# In short, DGL will group the nodes by their in-degrees, and for each
# group DGL stacks the incoming messages along the second dimension. You
# can then perform a reduction along the second dimension to aggregate
# messages.
#
# For more details on customizing message and reduce function with
# user-defined function, please refer to the :ref:`API
# reference <apiudf>`.
#
######################################################################
# Best practice of writing custom GNN modules
# -------------------------------------------
#
# DGL recommends the following practice ranked by preference:
#
# - Use ``dgl.nn`` modules.
# - Use ``dgl.nn.functional`` functions which contain lower-level complex
# operations such as computing a softmax for each node over incoming
# edges.
# - Use ``update_all`` with builtin message and reduce functions.
# - Use user-defined message or reduce functions.
#
######################################################################
# What’s next?
# ------------
#
# - :ref:`Writing Efficient Message Passing
# Code <guide-message-passing-efficient>`.
#
"""
Link Prediction using Graph Neural Networks
===========================================
In the :doc:`introduction <1_introduction>`, you have already learned the
basic workflow of using GNNs for node classification, i.e. predicting
the category of a node in a graph. This tutorial will teach you how to
train a GNN for link prediction, i.e. predicting the existence of an
edge between two arbitrary nodes in a graph.
By the end of this tutorial you will be able to
- Build a GNN-based link prediction model.
- Train and evaluate the model on a small DGL-provided dataset.
(Time estimate: 20 minutes)
"""
import dgl
import torch
import torch.nn as nn
import torch.nn.functional as F
import itertools
import numpy as np
import scipy.sparse as sp
######################################################################
# Overview of Link Prediction with GNN
# ------------------------------------
#
# Many applications such as social recommendation, item recommendation,
# knowledge graph completion, etc., can be formulated as link prediction,
# which predicts whether an edge exists between two particular nodes. This
# tutorial shows an example of predicting whether a citation relationship,
# either citing or being cited, between two papers exists in a citation
# network.
#
# This tutorial follows a relatively simple practice from
# `SEAL <https://papers.nips.cc/paper/2018/file/53f0d7c537d99b3824f0f99d62ea2428-Paper.pdf>`__.
# It formulates the link prediction problem as a binary classification
# problem as follows:
#
# - Treat the edges in the graph as *positive examples*.
# - Sample a number of non-existent edges (i.e. node pairs with no edges
# between them) as *negative* examples.
# - Divide the positive examples and negative examples into a training
# set and a test set.
# - Evaluate the model with any binary classification metric such as Area
# Under Curve (AUC).
#
# In some domains such as large-scale recommender systems or information
# retrieval, you may favor metrics that emphasize good performance of
# top-K predictions. In these cases you may want to consider other metrics
# such as mean average precision, and use other negative sampling methods,
# which are beyond the scope of this tutorial.
#
# Loading graph and features
# --------------------------
#
# Following the :doc:`introduction <1_introduction>`, we first load the
# Cora dataset.
#
import dgl.data
dataset = dgl.data.CoraGraphDataset()
g = dataset[0]
######################################################################
# Preparing training and testing sets
# -----------------------------------
#
# This tutorial randomly picks 10% of the edges for positive examples in
# the test set, and leave the rest for the training set. It then samples
# the same number of edges for negative examples in both sets.
#
# Split edge set for training and testing
u, v = g.edges()
eids = np.arange(g.number_of_edges())
eids = np.random.permutation(eids)
test_size = int(len(eids) * 0.1)
train_size = g.number_of_edges() - test_size
test_pos_u, test_pos_v = u[eids[:test_size]], v[eids[:test_size]]
train_pos_u, train_pos_v = u[eids[test_size:]], v[eids[test_size:]]
# Find all negative edges and split them for training and testing
adj = sp.coo_matrix((np.ones(len(u)), (u.numpy(), v.numpy())))
adj_neg = 1 - adj.todense() - np.eye(g.number_of_nodes())
neg_u, neg_v = np.where(adj_neg != 0)
neg_eids = np.random.choice(len(neg_u), g.number_of_edges() // 2)
test_neg_u, test_neg_v = neg_u[neg_eids[:test_size]], neg_v[neg_eids[:test_size]]
train_neg_u, train_neg_v = neg_u[neg_eids[test_size:]], neg_v[neg_eids[test_size:]]
# Create training set.
train_u = torch.cat([torch.as_tensor(train_pos_u), torch.as_tensor(train_neg_u)])
train_v = torch.cat([torch.as_tensor(train_pos_v), torch.as_tensor(train_neg_v)])
train_label = torch.cat([torch.zeros(len(train_pos_u)), torch.ones(len(train_neg_u))])
# Create testing set.
test_u = torch.cat([torch.as_tensor(test_pos_u), torch.as_tensor(test_neg_u)])
test_v = torch.cat([torch.as_tensor(test_pos_v), torch.as_tensor(test_neg_v)])
test_label = torch.cat([torch.zeros(len(test_pos_u)), torch.ones(len(test_neg_u))])
######################################################################
# When training, you will need to remove the edges in the test set from
# the original graph. You can do this via ``dgl.remove_edges``.
#
# .. note::
#
# ``dgl.remove_edges`` works by creating a subgraph from the original
# graph, resulting in a copy and therefore could be slow for large
# graphs. If so, you could save the training and test graph to
# disk, as you would do for preprocessing.
#
train_g = dgl.remove_edges(g, eids[:test_size])
######################################################################
# Defining a GraphSAGE model
# --------------------------
#
# This tutorial builds a model consisting of two
# `GraphSAGE <https://arxiv.org/abs/1706.02216>`__ layers, each computes
# new node representations by averaging neighbor information. DGL provides
# ``dgl.nn.SAGEConv`` that conveniently creates a GraphSAGE layer.
#
from dgl.nn import SAGEConv
# ----------- 2. create model -------------- #
# build a two-layer GraphSAGE model
class GraphSAGE(nn.Module):
def __init__(self, in_feats, h_feats):
super(GraphSAGE, self).__init__()
self.conv1 = SAGEConv(in_feats, h_feats, 'mean')
self.conv2 = SAGEConv(h_feats, h_feats, 'mean')
def forward(self, g, in_feat):
h = self.conv1(g, in_feat)
h = F.relu(h)
h = self.conv2(g, h)
return h
model = GraphSAGE(train_g.ndata['feat'].shape[1], 16)
######################################################################
# The model then predicts the probability of existence of an edge by
# computing a dot product between the representations of both incident
# nodes.
#
# .. math::
#
#
# \hat{y}_{u\sim v} = \sigma(h_u^T h_v)
#
# The loss function is simply binary cross entropy loss.
#
# .. math::
#
#
# \mathcal{L} = -\sum_{u\sim v\in \mathcal{D}}\left( y_{u\sim v}\log(\hat{y}_{u\sim v}) + (1-y_{u\sim v})\log(1-\hat{y}_{u\sim v})) \right)
#
# .. note::
#
# This tutorial does not include evaluation on a validation
# set. In practice you should save and evaluate the best model based on
# performance on the validation set.
#
# ----------- 3. set up loss and optimizer -------------- #
# in this case, loss will in training loop
optimizer = torch.optim.Adam(itertools.chain(model.parameters()), lr=0.01)
# ----------- 4. training -------------------------------- #
for e in range(100):
# forward
logits = model(train_g, train_g.ndata['feat'])
pred = torch.sigmoid((logits[train_u] * logits[train_v]).sum(dim=1))
# compute loss
loss = F.binary_cross_entropy(pred, train_label)
# backward
optimizer.zero_grad()
loss.backward()
optimizer.step()
if e % 5 == 0:
print('In epoch {}, loss: {}'.format(e, loss))
# ----------- 5. check results ------------------------ #
from sklearn.metrics import roc_auc_score
with torch.no_grad():
pred = torch.sigmoid((logits[test_u] * logits[test_v]).sum(dim=1))
pred = pred.numpy()
label = test_label.numpy()
print('AUC', roc_auc_score(label, pred))
"""
Training a GNN for Graph Classification
=======================================
By the end of this tutorial, you will be able to
- Load a DGL-provided graph classification dataset.
- Understand what *readout* function does.
- Understand how to create and use a minibatch of graphs.
- Build a GNN-based graph classification model.
- Train and evaluate the model on a DGL-provided dataset.
(Time estimate: 18 minutes)
"""
import dgl
import torch
import torch.nn as nn
import torch.nn.functional as F
######################################################################
# Overview of Graph Classification with GNN
# -----------------------------------------
#
# Graph classification or regression requires a model to predict certain
# graph-level properties of a single graph given its node and edge
# features. Molecular property prediction is one particular application.
#
# This tutorial shows how to train a graph classification model for a
# small dataset from the paper `How Powerful Are Graph Neural
# Networks <https://arxiv.org/abs/1810.00826>`__.
#
# Loading Data
# ------------
#
import dgl.data
# Generate a synthetic dataset with 10000 graphs, ranging from 10 to 500 nodes.
dataset = dgl.data.GINDataset('PROTEINS', self_loop=True)
######################################################################
# The dataset is a set of graphs, each with node features and a single
# label. One can see the node feature dimensionality and the number of
# possible graph categories of ``GINDataset`` objects in ``dim_nfeats``
# and ``gclasses`` attributes.
#
print('Node feature dimensionality:', dataset.dim_nfeats)
print('Number of graph categories:', dataset.gclasses)
######################################################################
# Defining Data Loader
# --------------------
#
# A graph classification dataset usually contains two types of elements: a
# set of graphs, and their graph-level labels. Similar to an image
# classification task, when the dataset is large enough, we need to train
# with mini-batches. When you train a model for image classification or
# language modeling, you will use a ``DataLoader`` to iterate over the
# dataset. In DGL, you can use the ``GraphDataLoader``.
#
# You can also use various dataset samplers provided in
# ```torch.utils.data.sampler`` <https://pytorch.org/docs/stable/data.html#data-loading-order-and-sampler>`__.
# For example, this tutorial creates a training ``GraphDataLoader`` and
# test ``GraphDataLoader``, using ``SubsetRandomSampler`` to tell PyTorch
# to sample from only a subset of the dataset.
#
from dgl.dataloading import GraphDataLoader
from torch.utils.data.sampler import SubsetRandomSampler
num_examples = len(dataset)
num_train = int(num_examples * 0.8)
train_sampler = SubsetRandomSampler(torch.arange(num_train))
test_sampler = SubsetRandomSampler(torch.arange(num_train, num_examples))
train_dataloader = GraphDataLoader(
dataset, sampler=train_sampler, batch_size=5, drop_last=False)
test_dataloader = GraphDataLoader(
dataset, sampler=test_sampler, batch_size=5, drop_last=False)
######################################################################
# You can try to iterate over the created ``GraphDataLoader`` and see what it
# gives:
#
it = iter(train_dataloader)
batch = next(it)
print(batch)
######################################################################
# As each element in ``dataset`` has a graph and a label, the
# ``GraphDataLoader`` will return two objects for each iteration. The
# first element is the batched graph, and the second element is simply a
# label vector representing the category of each graph in the mini-batch.
# Next, we’ll talked about the batched graph.
#
# A Batched Graph in DGL
# ----------------------
#
# In each mini-batch, the sampled graphs are combined into a single bigger
# batched graph via ``dgl.batch``. The single bigger batched graph merges
# all original graphs as separately connected components, with the node
# and edge features concatenated. This bigger graph is also a ``DGLGraph``
# instance (so you can
# still treat it as a normal ``DGLGraph`` object as in
# `here <2_dglgraph.ipynb>`__). It however contains the information
# necessary for recovering the original graphs, such as the number of
# nodes and edges of each graph element.
#
batched_graph, labels = batch
print('Number of nodes for each graph element in the batch:', batched_graph.batch_num_nodes())
print('Number of edges for each graph element in the batch:', batched_graph.batch_num_edges())
# Recover the original graph elements from the minibatch
graphs = dgl.unbatch(batched_graph)
print('The original graphs in the minibatch:')
print(graphs)
######################################################################
# Define Model
# ------------
#
# This tutorial will build a two-layer `Graph Convolutional Network
# (GCN) <http://tkipf.github.io/graph-convolutional-networks/>`__. Each of
# its layer computes new node representations by aggregating neighbor
# information. If you have gone through the
# :doc:`introduction <1_introduction>`, you will notice two
# differences:
#
# - Since the task is to predict a single category for the *entire graph*
# instead of for every node, you will need to aggregate the
# representations of all the nodes and potentially the edges to form a
# graph-level representation. Such process is more commonly referred as
# a *readout*. A simple choice is to average the node features of a
# graph with ``dgl.mean_nodes()``.
#
# - The input graph to the model will be a batched graph yielded by the
# ``GraphDataLoader``. The readout functions provided by DGL can handle
# batched graphs so that they will return one representation for each
# minibatch element.
#
from dgl.nn import GraphConv
class GCN(nn.Module):
def __init__(self, in_feats, h_feats, num_classes):
super(GCN, self).__init__()
self.conv1 = GraphConv(in_feats, h_feats)
self.conv2 = GraphConv(h_feats, num_classes)
def forward(self, g, in_feat):
h = self.conv1(g, in_feat)
h = F.relu(h)
h = self.conv2(g, h)
g.ndata['h'] = h
return dgl.mean_nodes(g, 'h')
######################################################################
# Training Loop
# -------------
#
# The training loop iterates over the training set with the
# ``GraphDataLoader`` object and computes the gradients, just like
# image classification or language modeling.
#
# Create the model with given dimensions
model = GCN(dataset.dim_nfeats, 16, dataset.gclasses)
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)
for epoch in range(20):
for batched_graph, labels in train_dataloader:
pred = model(batched_graph, batched_graph.ndata['attr'].float())
loss = F.cross_entropy(pred, labels)
optimizer.zero_grad()
loss.backward()
optimizer.step()
num_correct = 0
num_tests = 0
for batched_graph, labels in test_dataloader:
pred = model(batched_graph, batched_graph.ndata['attr'].float())
num_correct += (pred.argmax(1) == labels).sum().item()
num_tests += len(labels)
print('Test accuracy:', num_correct / num_tests)
######################################################################
# What’s next
# -----------
#
# - See `GIN
# example <https://github.com/dmlc/dgl/tree/master/examples/pytorch/gin>`__
# for an end-to-end graph classification model.
#
"""
Make Your Own Dataset
=====================
This tutorial assumes that you already know :doc:`the basics of training a
GNN for node classification <1_introduction>` and :doc:`how to
create, load, and store a DGL graph <2_dglgraph>`.
By the end of this tutorial, you will be able to
- Create your own graph dataset for node classification, link
prediction, or graph classification.
(Time estimate: 15 minutes)
"""
######################################################################
# ``DGLDataset`` Object Overview
# ------------------------------
#
# Your custom graph dataset should inherit the ``dgl.data.DGLDataset``
# class and implement the following methods:
#
# - ``__getitem__(self, i)``: retrieve the ``i``-th example of the
# dataset. An example often contains a single DGL graph, and
# occasionally its label.
# - ``__len__(self)``: the number of examples in the dataset.
# - ``process(self)``: load and process raw data from disk.
#
######################################################################
# Creating a Dataset for Node Classification or Link Prediction from CSV
# ----------------------------------------------------------------------
#
# A node classification dataset often consists of a single graph, as well
# as its node and edge features.
#
# This tutorial takes a small dataset based on `Zachary’s Karate Club
# network <https://en.wikipedia.org/wiki/Zachary%27s_karate_club>`__. It
# contains
#
# * A ``members.csv`` file containing the attributes of all
# members, as well as their attributes.
#
# * An ``interactions.csv`` file
# containing the pair-wise interactions between two club members.
#
import urllib.request
import pandas as pd
urllib.request.urlretrieve(
'https://data.dgl.ai/tutorial/dataset/members.csv', './members.csv')
urllib.request.urlretrieve(
'https://data.dgl.ai/tutorial/dataset/interactions.csv', './interactions.csv')
members = pd.read_csv('./members.csv')
members.head()
interactions = pd.read_csv('./interactions.csv')
interactions.head()
######################################################################
# This tutorial treats the members as nodes and interactions as edges. It
# takes age as a numeric feature of the nodes, affiliated club as the label
# of the nodes, and edge weight as a numeric feature of the edges.
#
# .. note::
#
# The original Zachary’s Karate Club network does not have
# member ages. The ages in this tutorial are generated synthetically
# for demonstrating how to add node features into the graph for dataset
# creation.
#
# .. note::
#
# In practice, taking age directly as a numeric feature may
# not work well in machine learning; strategies like binning or
# normalizing the feature would work better. This tutorial directly
# takes the values as-is for simplicity.
#
import dgl
from dgl.data import DGLDataset
import torch
import os
class KarateClubDataset(DGLDataset):
def __init__(self):
super().__init__(name='karate_club')
def process(self):
nodes_data = pd.read_csv('./members.csv')
edges_data = pd.read_csv('./interactions.csv')
node_features = torch.from_numpy(nodes_data['Age'].to_numpy())
node_labels = torch.from_numpy(nodes_data['Club'].astype('category').cat.codes.to_numpy())
edge_features = torch.from_numpy(edges_data['Weight'].to_numpy())
edges_src = torch.from_numpy(edges_data['Src'].to_numpy())
edges_dst = torch.from_numpy(edges_data['Dst'].to_numpy())
self.graph = dgl.graph((edges_src, edges_dst), num_nodes=nodes_data.shape[0])
self.graph.ndata['feat'] = node_features
self.graph.ndata['label'] = node_labels
self.graph.edata['weight'] = edge_features
# If your dataset is a node classification dataset, you will need to assign
# masks indicating whether a node belongs to training, validation, and test set.
n_nodes = nodes_data.shape[0]
n_train = int(n_nodes * 0.6)
n_val = int(n_nodes * 0.2)
train_mask = torch.zeros(n_nodes, dtype=torch.bool)
val_mask = torch.zeros(n_nodes, dtype=torch.bool)
test_mask = torch.zeros(n_nodes, dtype=torch.bool)
train_mask[:n_train] = True
val_mask[n_train:n_train + n_val] = True
test_mask[n_train + n_val:] = True
self.graph.ndata['train_mask'] = train_mask
self.graph.ndata['val_mask'] = val_mask
self.graph.ndata['test_mask'] = test_mask
def __getitem__(self, i):
return self.graph
def __len__(self):
return 1
dataset = KarateClubDataset()
graph = dataset[0]
print(graph)
######################################################################
# Since a link prediction dataset only involves a single graph, preparing
# a link prediction dataset will have the same experience as preparing a
# node classification dataset.
#
######################################################################
# Creating a Dataset for Graph Classification from CSV
# ----------------------------------------------------
#
# Creating a graph classification dataset involves implementing
# ``__getitem__`` to return both the graph and its graph-level label.
#
# This tutorial demonstrates how to create a graph classification dataset
# with the following synthetic CSV data:
#
# - ``graph_edges.csv``: containing three columns:
#
# - ``graph_id``: the ID of the graph.
# - ``src``: the source node of an edge of the given graph.
# - ``dst``: the destination node of an edge of the given graph.
#
# - ``graph_properties.csv``: containing three columns:
#
# - ``graph_id``: the ID of the graph.
# - ``label``: the label of the graph.
# - ``num_nodes``: the number of nodes in the graph.
#
urllib.request.urlretrieve(
'https://data.dgl.ai/tutorial/dataset/graph_edges.csv', './graph_edges.csv')
urllib.request.urlretrieve(
'https://data.dgl.ai/tutorial/dataset/graph_properties.csv', './graph_properties.csv')
edges = pd.read_csv('./graph_edges.csv')
properties = pd.read_csv('./graph_properties.csv')
edges.head()
properties.head()
class SyntheticDataset(DGLDataset):
def __init__(self):
super().__init__(name='synthetic')
def process(self):
edges = pd.read_csv('./graph_edges.csv')
properties = pd.read_csv('./graph_properties.csv')
self.graphs = []
self.labels = []
# Create a graph for each graph ID from the edges table.
# First process the properties table into two dictionaries with graph IDs as keys.
# The label and number of nodes are values.
label_dict = {}
num_nodes_dict = {}
for _, row in properties.iterrows():
label_dict[row['graph_id']] = row['label']
num_nodes_dict[row['graph_id']] = row['num_nodes']
# For the edges, first group the table by graph IDs.
edges_group = edges.groupby('graph_id')
# For each graph ID...
for graph_id in edges_group.groups:
# Find the edges as well as the number of nodes and its label.
edges_of_id = edges_group.get_group(graph_id)
src = edges_of_id['src'].to_numpy()
dst = edges_of_id['dst'].to_numpy()
num_nodes = num_nodes_dict[graph_id]
label = label_dict[graph_id]
# Create a graph and add it to the list of graphs and labels.
g = dgl.graph((src, dst), num_nodes=num_nodes)
self.graphs.append(g)
self.labels.append(label)
# Convert the label list to tensor for saving.
self.labels = torch.LongTensor(self.labels)
def __getitem__(self, i):
return self.graphs[i], self.labels[i]
def __len__(self):
return len(self.graphs)
dataset = SyntheticDataset()
graph, label = dataset[0]
print(graph, label)
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment