5_hetero.py 17.5 KB
Newer Older
1
2
3
"""
.. currentmodule:: dgl

4
5
Working with Heterogeneous Graphs
=================================
6
7
8
9

**Author**: Quan Gan, `Minjie Wang <https://jermainewang.github.io/>`_, Mufei Li,
George Karypis, Zheng Zhang

10
11
12
13
14
15
16
17
18
19
20
21
In this tutorial, you learn about:

* Examples of heterogenous graph data and typical applications.

* Creating and manipulating a heterogenous graph in DGL.

* Implementing `Relational-GCN <https://arxiv.org/abs/1703.06103>`_, a popular GNN model,
  for heterogenous graph input.

* Training a model to solve a node classification task.

Heterogeneous graphs, or *heterographs* for short, are graphs that contain
22
23
different types of nodes and edges. The different types of nodes and edges tend
to have different types of attributes that are designed to capture the
24
characteristics of each node and edge type. Within the context of
25
graph neural networks, depending on their complexity, certain node and edge types
26
might need to be modeled with representations that have a different number of dimensions.
27
28
29
30
31
32
33

DGL supports graph neural network computations on such heterogeneous graphs, by
using the heterograph class and its associated API.

"""

###############################################################################
34
# Examples of heterographs
35
# -----------------------
36
37
38
# Many graph datasets represent relationships among various types of entities.
# This section provides an overview for several graph use-cases that show such relationships 
# and can have their data represented as heterographs.
39
#
40
41
42
43
44
# Citation graph 
# ~~~~~~~~~~~~~~~
# The Association for Computing Machinery publishes an `ACM dataset <https://aminer.org/citation>`_ that contains two
# million papers, their authors, publication venues, and the other papers
# that were cited. This information can be represented as a heterogeneous graph.
45
#
46
47
# The following diagram shows several entities in the ACM dataset and the relationships among them 
# (taken from `Shi et al., 2015 <https://arxiv.org/pdf/1511.04854.pdf>`_).
48
#
Jinjing Zhou's avatar
Jinjing Zhou committed
49
# .. figure:: https://data.dgl.ai/tutorial/hetero/acm-example.png# 
50
51
52
# 
# This graph has three types of entities that correspond to papers, authors, and publication venues.
# It also contains three types of edges that connect the following:
53
#
54
# * Authors with papers corresponding to *written-by* relationships
55
#
56
# * Papers with publication venues corresponding to *published-in* relationships
57
#
58
# * Papers with other papers corresponding to *cited-by* relationships
59
60
#
#
61
62
63
64
65
66
# Recommender systems 
# ~~~~~~~~~~~~~~~~~~~~ 
# The datasets used in recommender systems often contain
# interactions between users and items. For example, the data could include the
# ratings that users have provided to movies. Such interactions can be modeled
# as heterographs.
67
#
68
# The nodes in these heterographs will have two types, *users* and *movies*. The edges
69
70
# will correspond to the user-movie interactions. Furthermore, if an interaction is
# marked with a rating, then each rating value could correspond to a different edge type.
71
# The following diagram shows an example of user-item interactions as a heterograph.
72
#
Jinjing Zhou's avatar
Jinjing Zhou committed
73
# .. figure:: https://data.dgl.ai/tutorial/hetero/recsys-example.png
74
75
#
#
76
77
78
79
# Knowledge graph 
# ~~~~~~~~~~~~~~~~
# Knowledge graphs are inherently heterogenous. For example, in
# Wikidata, Barack Obama (item Q76) is an instance of a human, which could be viewed as
80
# the entity class, whose spouse (item P26) is Michelle Obama (item Q13133) and
81
82
# occupation (item P106) is politician (item Q82955). The relationships are shown in the following.
# diagram.
83
#
Jinjing Zhou's avatar
Jinjing Zhou committed
84
# .. figure:: https://data.dgl.ai/tutorial/hetero/kg-example.png
85
86
87
88
89
#

###############################################################################
# Creating a heterograph in DGL
# -----------------------------
90
# You can create a heterograph in DGL using the :func:`dgl.heterograph` API.
91
92
# The argument to :func:`dgl.heterograph` is a dictionary. The keys are tuples
# in the form of ``(srctype, edgetype, dsttype)`` specifying the relation name
93
94
# and the two entity types it connects. Such tuples are called *canonical edge
# types*. The values are data to initialize the graph structures, that is, which
95
96
# nodes the edges actually connect.
#
97
# For instance, the following code creates the user-item interactions heterograph shown earlier.
98

99
# Each value of the dictionary is a pair of source and destination arrays.
100
101
102
# Nodes are integer IDs starting from zero. Nodes IDs of different types have
# separate countings.
import dgl
103
import numpy as np
104
105

ratings = dgl.heterograph(
106
107
    {('user', '+1', 'movie') : (np.array([0, 0, 1]), np.array([0, 1, 0])),
     ('user', '-1', 'movie') : (np.array([2]), np.array([1]))})
108
109

###############################################################################
110
111
112
113
# DGL supports creating a graph from a variety of data sources. The following
# code creates the same graph as the above.
#
# Creating from scipy matrix
114
115
116
117
118
119
120
import scipy.sparse as sp
plus1 = sp.coo_matrix(([1, 1, 1], ([0, 0, 1], [0, 1, 0])), shape=(3, 2))
minus1 = sp.coo_matrix(([1], ([2], [1])), shape=(3, 2))
ratings = dgl.heterograph(
    {('user', '+1', 'movie') : plus1,
     ('user', '-1', 'movie') : minus1})

121
# Creating from networkx graph
122
123
124
125
126
import networkx as nx
plus1 = nx.DiGraph()
plus1.add_nodes_from(['u0', 'u1', 'u2'], bipartite=0)
plus1.add_nodes_from(['m0', 'm1'], bipartite=1)
plus1.add_edges_from([('u0', 'm0'), ('u0', 'm1'), ('u1', 'm0')])
127
# To simplify the example, reuse the minus1 object.
128
# This also means that you could use different sources of graph data
129
# for different relationships.
130
131
132
133
134
135
136
ratings = dgl.heterograph(
    {('user', '+1', 'movie') : plus1,
     ('user', '-1', 'movie') : minus1})

###############################################################################
# Manipulating heterograph
# ------------------------
137
138
# You can create a more realistic heterograph using the ACM dataset. To do this, first 
# download the dataset as follows:
139
140
141
142

import scipy.io
import urllib.request

Jinjing Zhou's avatar
Jinjing Zhou committed
143
data_url = 'https://data.dgl.ai/dataset/ACM.mat'
144
145
146
147
148
149
150
151
data_file_path = '/tmp/ACM.mat'

urllib.request.urlretrieve(data_url, data_file_path)
data = scipy.io.loadmat(data_file_path)
print(list(data.keys()))

###############################################################################
# The dataset stores node information by their types: ``P`` for paper, ``A``
152
153
154
# for author, ``C`` for conference, ``L`` for subject code, and so on. The relationships
# are stored as SciPy sparse matrix under key ``XvsY``, where ``X`` and ``Y``
# could be any of the node type code.
155
#
156
# The following code prints out some statistics about the paper-author relationships.
157
158
159
160
161
162
163

print(type(data['PvsA']))
print('#Papers:', data['PvsA'].shape[0])
print('#Authors:', data['PvsA'].shape[1])
print('#Links:', data['PvsA'].nnz)

###############################################################################
164
# Converting this SciPy matrix to a heterograph in DGL is straightforward.
165
166
167
168
169
170

pa_g = dgl.heterograph({('paper', 'written-by', 'author') : data['PvsA']})
# equivalent (shorter) API for creating heterograph with two node types:
pa_g = dgl.bipartite(data['PvsA'], 'paper', 'written-by', 'author')

###############################################################################
171
# You can easily print out the type names and other structural information.
172
173
174
175
176

print('Node types:', pa_g.ntypes)
print('Edge types:', pa_g.etypes)
print('Canonical edge types:', pa_g.canonical_etypes)

177
178
# Nodes and edges are assigned integer IDs starting from zero and each type has its own counting.
# To distinguish the nodes and edges of different types, specify the type name as the argument.
179
180
181
182
183
184
185
186
print(pa_g.number_of_nodes('paper'))
# Canonical edge type name can be shortened to only one edge type name if it is
# uniquely distinguishable.
print(pa_g.number_of_edges(('paper', 'written-by', 'author')))
print(pa_g.number_of_edges('written-by'))
print(pa_g.successors(1, etype='written-by'))  # get the authors that write paper #1

# Type name argument could be omitted whenever the behavior is unambiguous.
187
print(pa_g.number_of_edges())  # Only one edge type, the edge type argument could be omitted
188
189

###############################################################################
190
191
# A homogeneous graph is just a special case of a heterograph with only one type
# of node and edge. In this case, all the APIs are exactly the same as in
192
193
# :class:`DGLGraph`.

194
# Paper-citing-paper graph is a homogeneous graph
195
196
197
198
pp_g = dgl.heterograph({('paper', 'citing', 'paper') : data['PvsP']})
# equivalent (shorter) API for creating homogeneous graph
pp_g = dgl.graph(data['PvsP'], 'paper', 'cite')

199
# All the ntype and etype arguments could be omitted because the behavior is unambiguous.
200
201
202
203
204
print(pp_g.number_of_nodes())
print(pp_g.number_of_edges())
print(pp_g.successors(3))

###############################################################################
205
206
207
# Create a subset of the ACM graph using the paper-author, paper-paper, 
# and paper-subject relationships.  Meanwhile, also add the reverse
# relationship to prepare for the later sections.
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225

G = dgl.heterograph({
        ('paper', 'written-by', 'author') : data['PvsA'],
        ('author', 'writing', 'paper') : data['PvsA'].transpose(),
        ('paper', 'citing', 'paper') : data['PvsP'],
        ('paper', 'cited', 'paper') : data['PvsP'].transpose(),
        ('paper', 'is-about', 'subject') : data['PvsL'],
        ('subject', 'has', 'paper') : data['PvsL'].transpose(),
    })

print(G)

###############################################################################
# **Metagraph** (or network schema) is a useful summary of a heterograph.
# Serving as a template for a heterograph, it tells how many types of objects
# exist in the network and where the possible links exist.
#
# DGL provides easy access to the metagraph, which could be visualized using
226
# external tools.
227

228
# Draw the metagraph using graphviz.
229
230
231
232
233
234
235
236
import pygraphviz as pgv
def plot_graph(nxg):
    ag = pgv.AGraph(strict=False, directed=True)
    for u, v, k in nxg.edges(keys=True):
        ag.add_edge(u, v, label=k)
    ag.layout('dot')
    ag.draw('graph.png')

237
plot_graph(G.metagraph())
238
239
240
241
242
243

###############################################################################
# Learning tasks associated with heterographs
# -------------------------------------------
# Some of the typical learning tasks that involve heterographs include:
#
244
# * *Node classification and regression* to predict the class of each node or
245
246
#   estimate a value associated with it.
#
247
# * *Link prediction* to predict if there is an edge of a certain
248
249
250
#   type between a pair of nodes, or predict which other nodes a particular
#   node is connected with (and optionally the edge types of such connections).
#
251
# * *Graph classification/regression* to assign an entire
252
253
254
255
256
257
258
259
260
261
262
263
#   heterograph into one of the target classes or to estimate a numerical
#   value associated with it.
#
# In this tutorial, we designed a simple example for the first task.
#
# A semi-supervised node classification example
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Our goal is to predict the publishing conference of a paper using the ACM
# academic graph we just created. To further simplify the task, we only focus
# on papers published in three conferences: *KDD*, *ICML*, and *VLDB*. All
# the other papers are not labeled, making it a semi-supervised setting.
#
264
265
# The following code extracts those papers from the raw dataset and prepares 
# the training, validation, testing split.
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292

import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F

pvc = data['PvsC'].tocsr()
# find all papers published in KDD, ICML, VLDB
c_selected = [0, 11, 13]  # KDD, ICML, VLDB
p_selected = pvc[:, c_selected].tocoo()
# generate labels
labels = pvc.indices
labels[labels == 11] = 1
labels[labels == 13] = 2
labels = torch.tensor(labels).long()

# generate train/val/test split
pid = p_selected.row
shuffle = np.random.permutation(pid)
train_idx = torch.tensor(shuffle[0:800]).long()
val_idx = torch.tensor(shuffle[800:900]).long()
test_idx = torch.tensor(shuffle[900:]).long()

###############################################################################
# Relational-GCN on heterograph
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# We use `Relational-GCN <https://arxiv.org/abs/1703.06103>`_ to learn the
293
# representation of nodes in the graph. Its message-passing equation is as
294
295
296
297
298
299
300
# follows:
#
# .. math::
#
#    h_i^{(l+1)} = \sigma\left(\sum_{r\in \mathcal{R}}
#    \sum_{j\in\mathcal{N}_r(i)}W_r^{(l)}h_j^{(l)}\right)
#
301
302
# Breaking down the equation, you see that there are two parts in the
# computation.
303
#
304
# (i) Message computation and aggregation within each relation :math:`r`
305
#
306
# (ii) Reduction that merges the results from multiple relationships
307
#
308
309
# Following this intuition, perform message passing on a heterograph in
# two steps.
310
#
311
# (i) Per-edge-type message passing
312
#
313
# (ii) Type wise reduction
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345

import dgl.function as fn

class HeteroRGCNLayer(nn.Module):
    def __init__(self, in_size, out_size, etypes):
        super(HeteroRGCNLayer, self).__init__()
        # W_r for each relation
        self.weight = nn.ModuleDict({
                name : nn.Linear(in_size, out_size) for name in etypes
            })

    def forward(self, G, feat_dict):
        # The input is a dictionary of node features for each type
        funcs = {}
        for srctype, etype, dsttype in G.canonical_etypes:
            # Compute W_r * h
            Wh = self.weight[etype](feat_dict[srctype])
            # Save it in graph for message passing
            G.nodes[srctype].data['Wh_%s' % etype] = Wh
            # Specify per-relation message passing functions: (message_func, reduce_func).
            # Note that the results are saved to the same destination feature 'h', which
            # hints the type wise reducer for aggregation.
            funcs[etype] = (fn.copy_u('Wh_%s' % etype, 'm'), fn.mean('m', 'h'))
        # Trigger message passing of multiple types.
        # The first argument is the message passing functions for each relation.
        # The second one is the type wise reducer, could be "sum", "max",
        # "min", "mean", "stack"
        G.multi_update_all(funcs, 'sum')
        # return the updated node feature dictionary
        return {ntype : G.nodes[ntype].data['h'] for ntype in G.ntypes}

###############################################################################
346
347
# Create a simple GNN by stacking two ``HeteroRGCNLayer``. Since the
# nodes do not have input features, make their embeddings trainable.
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371

class HeteroRGCN(nn.Module):
    def __init__(self, G, in_size, hidden_size, out_size):
        super(HeteroRGCN, self).__init__()
        # Use trainable node embeddings as featureless inputs.
        embed_dict = {ntype : nn.Parameter(torch.Tensor(G.number_of_nodes(ntype), in_size))
                      for ntype in G.ntypes}
        for key, embed in embed_dict.items():
            nn.init.xavier_uniform_(embed)
        self.embed = nn.ParameterDict(embed_dict)
        # create layers
        self.layer1 = HeteroRGCNLayer(in_size, hidden_size, G.etypes)
        self.layer2 = HeteroRGCNLayer(hidden_size, out_size, G.etypes)

    def forward(self, G):
        h_dict = self.layer1(G, self.embed)
        h_dict = {k : F.leaky_relu(h) for k, h in h_dict.items()}
        h_dict = self.layer2(G, h_dict)
        # get paper logits
        return h_dict['paper']

###############################################################################
# Train and evaluate
# ~~~~~~~~~~~~~~~~~~
372
# Train and evaluate this network.
373

374
# Create the model. The output has three logits for three classes.
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
model = HeteroRGCN(G, 10, 10, 3)

opt = torch.optim.Adam(model.parameters(), lr=0.01, weight_decay=5e-4)

best_val_acc = 0
best_test_acc = 0

for epoch in range(100):
    logits = model(G)
    # The loss is computed only for labeled nodes.
    loss = F.cross_entropy(logits[train_idx], labels[train_idx])

    pred = logits.argmax(1)
    train_acc = (pred[train_idx] == labels[train_idx]).float().mean()
    val_acc = (pred[val_idx] == labels[val_idx]).float().mean()
    test_acc = (pred[test_idx] == labels[test_idx]).float().mean()

    if best_val_acc < val_acc:
        best_val_acc = val_acc
        best_test_acc = test_acc

    opt.zero_grad()
    loss.backward()
    opt.step()

    if epoch % 5 == 0:
        print('Loss %.4f, Train Acc %.4f, Val Acc %.4f (Best %.4f), Test Acc %.4f (Best %.4f)' % (
            loss.item(),
            train_acc.item(),
            val_acc.item(),
            best_val_acc.item(),
            test_acc.item(),
            best_test_acc.item(),
        ))

###############################################################################
# What's next?
# ------------
# * Check out our full implementation in PyTorch
#   `here <https://github.com/dmlc/dgl/tree/master/examples/pytorch/rgcn-hetero>`_.
#
# * We also provide the following model examples:
#
#   * `Graph Convolutional Matrix Completion <https://arxiv.org/abs/1706.02263>_`,
#     which we implement in MXNet
#     `here <https://github.com/dmlc/dgl/tree/v0.4.0/examples/mxnet/gcmc>`_.
#
#   * `Heterogeneous Graph Attention Network <https://arxiv.org/abs/1903.07293>`_
#     requires transforming a heterograph into a homogeneous graph according to
#     a given metapath (i.e. a path template consisting of edge types).  We
#     provide :func:`dgl.transform.metapath_reachable_graph` to do this.  See full
#     implementation
#     `here <https://github.com/dmlc/dgl/tree/master/examples/pytorch/han>`_.
#
#   * `Metapath2vec <https://dl.acm.org/citation.cfm?id=3098036>`_ requires
#     generating random walk paths according to a given metapath.  Please
#     refer to the full metapath2vec implementation
#     `here <https://github.com/dmlc/dgl/tree/master/examples/pytorch/metapath2vec>`_.
#
# * :doc:`Full heterograph API reference <../../api/python/heterograph>`.