README.md 15 KB
Newer Older
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
# DGL-Go


DGL-Go is a command line tool for users to get started with training, using and
studying Graph Neural Networks (GNNs). Data scientists can quickly apply GNNs
to their problems, whereas researchers will find it useful to customize their
experiments.


## Installation and get started

DGL-Go requires DGL v0.8+ so please make sure DGL is updated properly.
Install DGL-Go by `pip install dglgo` and type `dgl` in your console:
```
Usage: dgl [OPTIONS] COMMAND [ARGS]...

Options:
  --help  Show this message and exit.

Commands:
  configure  Generate a configuration file
  export     Export a runnable python script
  recipe     Get example recipes
  train      Launch training
```

Minjie Wang's avatar
Minjie Wang committed
27
28
29
<p align="center">
  <img src="./dglgo.png" height="400">
</p>
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399

Using DGL-Go is as easy as three steps:

1. Use `dgl configure` to pick the task, dataset and model of your interests. It generates
   a configuration file for later use. You could also use `dgl recipe get` to retrieve
   a configuration file we provided.
1. Use `dgl train` to launch training according to the configuration and see the results.
1. Use `dgl export` to generate a *self-contained, reproducible* Python script for advanced
   customization, or try the model on custom data stored in CSV format.

Next, we will walk through all these steps one-by-one.

## Training GraphSAGE for node classification on Cora

Let's use one of the most classical setups -- training a GraphSAGE model for node
classification on the Cora citation graph dataset as an
example.

### Step one: `dgl configure`

First step, use `dgl configure` to generate a YAML configuration file.

```
dgl configure nodepred --data cora --model sage --cfg cora_sage.yaml
```

Note that `nodepred` is the name of DGL-Go *pipeline*. For now, you can think of
pipeline as training task: `nodepred` is for node prediction task; other
options include `linkpred` for link prediction task, etc. The command will
generate a configurate file `cora_sage.yaml` which includes:

* Options for the selected dataset (i.e., `cora` here).
* Model hyperparameters (e.g., number of layers, hidden size, etc.).
* Training hyperparameters (e.g., learning rate, loss function, etc.).

Different choices of task, model and datasets may give very different options,
so DGL-Go also adds a comment for what each option does in the file.
At this point you can also change options to explore optimization potentials. 

Below shows the configuration file generated by the command above.

```yaml
version: 0.0.1
pipeline_name: nodepred
device: cpu
data:
  name: cora
  split_ratio:                # Ratio to generate split masks, for example set to [0.8, 0.1, 0.1] for 80% train/10% val/10% test. Leave blank to use builtin split in original dataset
model:
  name: sage
  embed_size: -1              # The dimension of created embedding table. -1 means using original node embedding
  hidden_size: 16             # Hidden size.
  num_layers: 1               # Number of hidden layers.
  activation: relu            # Activation function name under torch.nn.functional
  dropout: 0.5                # Dropout rate.
  aggregator_type: gcn        # Aggregator type to use (``mean``, ``gcn``, ``pool``, ``lstm``).
general_pipeline:
  early_stop:
    patience: 20              # Steps before early stop
    checkpoint_path: checkpoint.pth # Early stop checkpoint model file path
  num_epochs: 200             # Number of training epochs
  eval_period: 5              # Interval epochs between evaluations
  optimizer:
    name: Adam
    lr: 0.01
    weight_decay: 0.0005
  loss: CrossEntropyLoss
  save_path: model.pth        # Path to save the model
  num_runs: 1                 # Number of experiments to run
```

Apart from `dgl configure`, you could also get one of DGL-Go's built-in configuration files
(called *recipe*) using `dgl recipe`. There are two sub-commands:

```
dgl recipe list
```

will list the available recipes:

```
➜ dgl recipe list     
===============================================================================
| Filename                       |  Pipeline           | Dataset              |
===============================================================================
| linkpred_citation2_sage.yaml   |  linkpred           | ogbl-citation2       |
| linkpred_collab_sage.yaml      |  linkpred           | ogbl-collab          |
| nodepred_citeseer_sage.yaml    |  nodepred           | citeseer             |
| nodepred_citeseer_gcn.yaml     |  nodepred           | citeseer             |
| nodepred-ns_arxiv_gcn.yaml     |  nodepred-ns        | ogbn-arxiv           |
| nodepred_cora_gat.yaml         |  nodepred           | cora                 |
| nodepred_pubmed_sage.yaml      |  nodepred           | pubmed               |
| linkpred_cora_sage.yaml        |  linkpred           | cora                 |
| nodepred_pubmed_gcn.yaml       |  nodepred           | pubmed               |
| nodepred_pubmed_gat.yaml       |  nodepred           | pubmed               |
| nodepred_cora_gcn.yaml         |  nodepred           | cora                 |
| nodepred_cora_sage.yaml        |  nodepred           | cora                 |
| nodepred_citeseer_gat.yaml     |  nodepred           | citeseer             |
| nodepred-ns_product_sage.yaml  |  nodepred-ns        | ogbn-products        |
===============================================================================
```

Then use

```
dgl recipe get nodepred_cora_sage.yaml
```

to copy the YAML configuration file to your local folder.

### Step 2: `dgl train`

Simply run `dgl train --cfg cora_sage.yaml` will start the training process.
```log
...
Epoch 00190 | Loss 1.5225 | TrainAcc 0.9500 | ValAcc 0.6840
Epoch 00191 | Loss 1.5416 | TrainAcc 0.9357 | ValAcc 0.6840
Epoch 00192 | Loss 1.5391 | TrainAcc 0.9357 | ValAcc 0.6840
Epoch 00193 | Loss 1.5257 | TrainAcc 0.9643 | ValAcc 0.6840
Epoch 00194 | Loss 1.5196 | TrainAcc 0.9286 | ValAcc 0.6840
EarlyStopping counter: 12 out of 20
Epoch 00195 | Loss 1.4862 | TrainAcc 0.9643 | ValAcc 0.6760
Epoch 00196 | Loss 1.5142 | TrainAcc 0.9714 | ValAcc 0.6760
Epoch 00197 | Loss 1.5145 | TrainAcc 0.9714 | ValAcc 0.6760
Epoch 00198 | Loss 1.5174 | TrainAcc 0.9571 | ValAcc 0.6760
Epoch 00199 | Loss 1.5235 | TrainAcc 0.9714 | ValAcc 0.6760
Test Accuracy 0.7740
Accuracy across 1 runs: 0.774 ± 0.0
```

That's all! Basically you only need two commands to train a graph neural network.

### Step 3: `dgl export` for more advanced customization

That's not everything yet. You may want to open the hood and and invoke deeper
customization. DGL-Go can export a **self-contained, reproducible** Python
script for you to do anything you like. 

Try `dgl export --cfg cora_sage.yaml --output script.py`,
and you'll get the script used to train the model. Here's the code snippet:

```python
...

class GraphSAGE(nn.Module):
    def __init__(self,
                 data_info: dict,
                 embed_size: int = -1,
                 hidden_size: int = 16,
                 num_layers: int = 1,
                 activation: str = "relu",
                 dropout: float = 0.5,
                 aggregator_type: str = "gcn"):
        """GraphSAGE model

        Parameters
        ----------
        data_info : dict
            The information about the input dataset.
        embed_size : int
            The dimension of created embedding table. -1 means using original node embedding
        hidden_size : int
            Hidden size.
        num_layers : int
            Number of hidden layers.
        dropout : float
            Dropout rate.
        activation : str
            Activation function name under torch.nn.functional
        aggregator_type : str
            Aggregator type to use (``mean``, ``gcn``, ``pool``, ``lstm``).
        """
        super(GraphSAGE, self).__init__()
        self.data_info = data_info
        self.embed_size = embed_size
        if embed_size > 0:
            self.embed = nn.Embedding(data_info["num_nodes"], embed_size)
            in_size = embed_size
        else:
            in_size = data_info["in_size"]
        self.layers = nn.ModuleList()
        self.dropout = nn.Dropout(dropout)
        self.activation = getattr(nn.functional, activation)

        for i in range(num_layers):
            in_hidden = hidden_size if i > 0 else in_size
            out_hidden = hidden_size if i < num_layers - 1 else data_info["out_size"]
            self.layers.append(dgl.nn.SAGEConv( in_hidden, out_hidden, aggregator_type))

    def forward(self, graph, node_feat, edge_feat=None):
        if self.embed_size > 0:
            dgl_warning(
                "The embedding for node feature is used, and input node_feat is ignored, due to the provided embed_size.",
                norepeat=True)
            h = self.embed.weight
        else:
            h = node_feat
        h = self.dropout(h)
        for l, layer in enumerate(self.layers):
            h = layer(graph, h, edge_feat)
            if l != len(self.layers) - 1:
                h = self.activation(h)
                h = self.dropout(h)
        return h

...

def train(cfg, pipeline_cfg, device, data, model, optimizer, loss_fcn):
    g = data[0]  # Only train on the first graph
    g = dgl.remove_self_loop(g)
    g = dgl.add_self_loop(g)
    g = g.to(device)

    node_feat = g.ndata.get('feat', None)
    edge_feat = g.edata.get('feat', None)
    label = g.ndata['label']
    train_mask, val_mask, test_mask = g.ndata['train_mask'].bool(
    ), g.ndata['val_mask'].bool(), g.ndata['test_mask'].bool()

    stopper = EarlyStopping(**pipeline_cfg['early_stop'])

    val_acc = 0.
    for epoch in range(pipeline_cfg['num_epochs']):
        model.train()
        logits = model(g, node_feat, edge_feat)
        loss = loss_fcn(logits[train_mask], label[train_mask])

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        train_acc = accuracy(logits[train_mask], label[train_mask])
        if epoch != 0 and epoch % pipeline_cfg['eval_period'] == 0:
            val_acc = accuracy(logits[val_mask], label[val_mask])

            if stopper.step(val_acc, model):
                break

        print("Epoch {:05d} | Loss {:.4f} | TrainAcc {:.4f} | ValAcc {:.4f}".
              format(epoch, loss.item(), train_acc, val_acc))

    stopper.load_checkpoint(model)

    model.eval()
    with torch.no_grad():
        logits = model(g, node_feat, edge_feat)
        test_acc = accuracy(logits[test_mask], label[test_mask])
    return test_acc


def main():
    cfg = {
        'version': '0.0.1',
        'device': 'cuda:0',
        'model': {
            'embed_size': -1,
            'hidden_size': 16,
            'num_layers': 2,
            'activation': 'relu',
            'dropout': 0.5,
            'aggregator_type': 'gcn'},
        'general_pipeline': {
            'early_stop': {
                'patience': 100,
                'checkpoint_path': 'checkpoint.pth'},
            'num_epochs': 200,
            'eval_period': 5,
            'optimizer': {
                'lr': 0.01,
                'weight_decay': 0.0005},
            'loss': 'CrossEntropyLoss',
            'save_path': 'model.pth',
            'num_runs': 10}}
    device = cfg['device']
    pipeline_cfg = cfg['general_pipeline']
    # load data
    data = AsNodePredDataset(CoraGraphDataset())
    # create model
    model_cfg = cfg["model"]
    cfg["model"]["data_info"] = {
        "in_size": model_cfg['embed_size'] if model_cfg['embed_size'] > 0 else data[0].ndata['feat'].shape[1],
        "out_size": data.num_classes,
        "num_nodes": data[0].num_nodes()
    }
    model = GraphSAGE(**cfg["model"])
    model = model.to(device)
    loss = torch.nn.CrossEntropyLoss()
    optimizer = torch.optim.Adam(
        model.parameters(),
        **pipeline_cfg["optimizer"])
    # train
    test_acc = train(cfg, pipeline_cfg, device, data, model, optimizer, loss)
    torch.save(model, pipeline_cfg["save_path"])
    return test_acc

...
```

You can see that everything is collected into one Python script which includes the
entire `GraphSAGE` model definition, data processing and training loop. Simply running
`python script.py` will give you the *exact same* result as you've seen by `dgl train`.
At this point, you can change any part as you wish such as plugging your own GNN module,
changing the loss function and so on.

## Use DGL-Go on your own dataset

DGL-Go supports training a model on custom dataset by DGL's `CSVDataset`.

### Step 1: Prepare your CSV and metadata file.

Follow the tutorial at [Loading data from CSV
files](https://docs.dgl.ai/en/latest/guide/data-loadcsv.html#guide-data-pipeline-loadcsv`)
to prepare your dataset. Generally, the dataset folder should include:
* At least one CSV file for node data.
* At least one CSV file for edge data.
* A metadata file called `meta.yaml`.

### Step 2: `dgl configure` with `--data csv` option
Run

```
dgl configure nodepred --data csv --model sage --cfg csv_sage.yaml
```

to generate the configuration file. You will see that the file includes a section like
the followings:

```yaml
...
data:
  name: csv
  split_ratio:                # Ratio to generate split masks, for example set to [0.8, 0.1, 0.1] for 80% train/10% val/10% test. Leave blank to use builtin split in original dataset
  data_path: ./               # metadata.yaml, nodes.csv, edges.csv should in this folder
...
```

Fill in the `data_path` option with the path to your dataset folder.

If your dataset does not have any native split for training, validation and test sets,
you can set the split ratio in the `split_ratio` option, which will
generate a random split for you.

### Step 3: `train` the model / `export` the script
Then you can do the same as the tutorial above, either train the model by
`dgl train --cfg csv_sage.yaml` or use `dgl export --cfg csv_sage.yaml
--output script.py` to get the training script.

## FAQ

**Q: What are the available options for each command?**
A: You can use `--help` for all commands. For example, use `dgl --help` for general
help message; use `dgl configure --help` for the configuration options; use
`dgl configure nodepred --help` for the configuration options of node prediction pipeline.

**Q: What exactly is nodepred/linkpred? How many are they?**
A: They are called DGl-Go pipelines. A pipeline represents the training methodology for
a certain task. Therefore, its naming convention is *<task_name>[-<method_name>]*. For example,
`nodepred` trains the selected GNN model for node classification using full-graph training method;
while `nodepred-ns` trains the model for node classifiation but using neighbor sampling.
The first release included three training pipelines (`nodepred`, `nodepred-ns` and `linkpred`)
but you can expect more will be coming in the future. Use `dgl configure --help` to see
all the available pipelines.

**Q: How to add my model to the official model recipe zoo?**
A: Currently not supported. We will enable this feature soon. Please stay tuned!

**Q: After training a model on some dataset, how can I apply it to another one?**
A: The `save_path` option in the generated configuration file allows you to specify where
to save the model after training. You can then modify the script generated by `dgl export`
to load the the model checkpoint and evaluate it on another dataset.