ondisk_dataset_heterograph.ipynb 37.7 KB
Newer Older
1
2
3
4
{
  "cells": [
    {
      "cell_type": "markdown",
5
6
7
      "metadata": {
        "id": "FnFhPMaAfLtJ"
      },
8
9
10
11
12
      "source": [
        "# OnDiskDataset for Heterogeneous Graph\n",
        "\n",
        "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/dmlc/dgl/blob/master/notebooks/stochastic_training/ondisk_dataset_heterograph.ipynb) [![GitHub](https://img.shields.io/badge/-View%20on%20GitHub-181717?logo=github&logoColor=ffffff)](https://github.com/dmlc/dgl/blob/master/notebooks/stochastic_training/ondisk_dataset_heterograph.ipynb)\n",
        "\n",
13
        "This tutorial shows how to create `OnDiskDataset` for heterogeneous graph that could be used in **GraphBolt** framework. The major difference from creating dataset for homogeneous graph is that we need to specify node/edge types for edges, feature data, training/validation/test sets.\n",
14
15
        "\n",
        "By the end of this tutorial, you will be able to\n",
16
        "\n",
17
18
        "- organize graph structure data.\n",
        "- organize feature data.\n",
19
20
21
22
23
        "- organize training/validation/test set for specific tasks.\n",
        "\n",
        "To create an ``OnDiskDataset`` object, you need to organize all the data including graph structure, feature data and tasks into a directory. The directory should contain a ``metadata.yaml`` file that describes the metadata of the dataset.\n",
        "\n",
        "Now let's generate various data step by step and organize them together to instantiate `OnDiskDataset` finally."
24
      ]
25
26
27
28
29
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Wlb19DtWgtzq"
30
31
32
33
      },
      "source": [
        "## Install DGL package"
      ]
34
35
36
    },
    {
      "cell_type": "code",
37
38
39
40
41
      "execution_count": null,
      "metadata": {
        "id": "UojlT9ZGgyr9"
      },
      "outputs": [],
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
      "source": [
        "# Install required packages.\n",
        "import os\n",
        "import torch\n",
        "import numpy as np\n",
        "os.environ['TORCH'] = torch.__version__\n",
        "os.environ['DGLBACKEND'] = \"pytorch\"\n",
        "\n",
        "# Install the CPU version.\n",
        "device = torch.device(\"cpu\")\n",
        "!pip install --pre dgl -f https://data.dgl.ai/wheels-test/repo.html\n",
        "\n",
        "try:\n",
        "    import dgl\n",
        "    import dgl.graphbolt as gb\n",
        "    installed = True\n",
        "except ImportError as error:\n",
        "    installed = False\n",
        "    print(error)\n",
        "print(\"DGL installed!\" if installed else \"DGL not found!\")"
62
      ]
63
64
65
    },
    {
      "cell_type": "markdown",
66
67
68
      "metadata": {
        "id": "2R7WnSbjsfbr"
      },
69
70
71
      "source": [
        "## Data preparation\n",
        "In order to demonstrate how to organize various data, let's create a base directory first."
72
      ]
73
74
75
    },
    {
      "cell_type": "code",
76
77
78
79
80
      "execution_count": null,
      "metadata": {
        "id": "SZipbzyltLfO"
      },
      "outputs": [],
81
82
83
84
      "source": [
        "base_dir = './ondisk_dataset_heterograph'\n",
        "os.makedirs(base_dir, exist_ok=True)\n",
        "print(f\"Created base directory: {base_dir}\")"
85
      ]
86
87
88
    },
    {
      "cell_type": "markdown",
89
90
91
      "metadata": {
        "id": "qhNtIn_xhlnl"
      },
92
93
      "source": [
        "### Generate graph structure data\n",
94
        "For heterogeneous graph, we need to save different edge edges(namely seeds) into separate **Numpy** or **CSV** files.\n",
95
        "\n",
96
97
98
        "Note:\n",
        "- when saving to **Numpy**, the array requires to be in shape of `(2, N)`. This format is recommended as constructing graph from it is much faster than **CSV** file.\n",
        "- when saving to **CSV** file, do not save index and header.\n"
99
      ]
100
101
102
    },
    {
      "cell_type": "code",
103
104
105
106
107
      "execution_count": null,
      "metadata": {
        "id": "HcBt4G5BmSjr"
      },
      "outputs": [],
108
109
110
      "source": [
        "import numpy as np\n",
        "import pandas as pd\n",
111
112
113
114
115
        "\n",
        "# For simplicity, we create a heterogeneous graph with\n",
        "# 2 node types: `user`, `item`\n",
        "# 2 edge types: `user:like:item`, `user:follow:user`\n",
        "# And each node/edge type has the same number of nodes/edges.\n",
116
117
118
        "num_nodes = 1000\n",
        "num_edges = 10 * num_nodes\n",
        "\n",
119
120
121
        "# Edge type: \"user:like:item\"\n",
        "like_edges_path = os.path.join(base_dir, \"like-edges.csv\")\n",
        "like_edges = np.random.randint(0, num_nodes, size=(num_edges, 2))\n",
122
        "print(f\"Part of [user:like:item] edges: {like_edges[:5, :]}\\n\")\n",
123
124
125
        "\n",
        "df = pd.DataFrame(like_edges)\n",
        "df.to_csv(like_edges_path, index=False, header=False)\n",
126
        "print(f\"[user:like:item] edges are saved into {like_edges_path}\\n\")\n",
127
        "\n",
128
129
130
        "# Edge type: \"user:follow:user\"\n",
        "follow_edges_path = os.path.join(base_dir, \"follow-edges.csv\")\n",
        "follow_edges = np.random.randint(0, num_nodes, size=(num_edges, 2))\n",
131
        "print(f\"Part of [user:follow:user] edges: {follow_edges[:5, :]}\\n\")\n",
132
        "\n",
133
134
        "df = pd.DataFrame(follow_edges)\n",
        "df.to_csv(follow_edges_path, index=False, header=False)\n",
135
        "print(f\"[user:follow:user] edges are saved into {follow_edges_path}\\n\")"
136
      ]
137
138
139
    },
    {
      "cell_type": "markdown",
140
141
142
      "metadata": {
        "id": "kh-4cPtzpcaH"
      },
143
144
      "source": [
        "### Generate feature data for graph\n",
145
        "For feature data, numpy arrays and torch tensors are supported for now. Let's generate feature data for each node/edge type."
146
      ]
147
148
149
    },
    {
      "cell_type": "code",
150
151
152
153
154
      "execution_count": null,
      "metadata": {
        "id": "_PVu1u5brBhF"
      },
      "outputs": [],
155
      "source": [
156
157
158
        "# Generate node[user] feature in numpy array.\n",
        "node_user_feat_0_path = os.path.join(base_dir, \"node-user-feat-0.npy\")\n",
        "node_user_feat_0 = np.random.rand(num_nodes, 5)\n",
159
        "print(f\"Part of node[user] feature [feat_0]: {node_user_feat_0[:3, :]}\")\n",
160
        "np.save(node_user_feat_0_path, node_user_feat_0)\n",
161
        "print(f\"Node[user] feature [feat_0] is saved to {node_user_feat_0_path}\\n\")\n",
162
163
164
165
        "\n",
        "# Generate another node[user] feature in torch tensor\n",
        "node_user_feat_1_path = os.path.join(base_dir, \"node-user-feat-1.pt\")\n",
        "node_user_feat_1 = torch.rand(num_nodes, 5)\n",
166
        "print(f\"Part of node[user] feature [feat_1]: {node_user_feat_1[:3, :]}\")\n",
167
        "torch.save(node_user_feat_1, node_user_feat_1_path)\n",
168
        "print(f\"Node[user] feature [feat_1] is saved to {node_user_feat_1_path}\\n\")\n",
169
170
171
172
        "\n",
        "# Generate node[item] feature in numpy array.\n",
        "node_item_feat_0_path = os.path.join(base_dir, \"node-item-feat-0.npy\")\n",
        "node_item_feat_0 = np.random.rand(num_nodes, 5)\n",
173
        "print(f\"Part of node[item] feature [feat_0]: {node_item_feat_0[:3, :]}\")\n",
174
        "np.save(node_item_feat_0_path, node_item_feat_0)\n",
175
        "print(f\"Node[item] feature [feat_0] is saved to {node_item_feat_0_path}\\n\")\n",
176
177
178
179
        "\n",
        "# Generate another node[item] feature in torch tensor\n",
        "node_item_feat_1_path = os.path.join(base_dir, \"node-item-feat-1.pt\")\n",
        "node_item_feat_1 = torch.rand(num_nodes, 5)\n",
180
        "print(f\"Part of node[item] feature [feat_1]: {node_item_feat_1[:3, :]}\")\n",
181
        "torch.save(node_item_feat_1, node_item_feat_1_path)\n",
182
        "print(f\"Node[item] feature [feat_1] is saved to {node_item_feat_1_path}\\n\")\n",
183
184
185
186
        "\n",
        "# Generate edge[user:like:item] feature in numpy array.\n",
        "edge_like_feat_0_path = os.path.join(base_dir, \"edge-like-feat-0.npy\")\n",
        "edge_like_feat_0 = np.random.rand(num_edges, 5)\n",
187
        "print(f\"Part of edge[user:like:item] feature [feat_0]: {edge_like_feat_0[:3, :]}\")\n",
188
        "np.save(edge_like_feat_0_path, edge_like_feat_0)\n",
189
        "print(f\"Edge[user:like:item] feature [feat_0] is saved to {edge_like_feat_0_path}\\n\")\n",
190
191
192
193
        "\n",
        "# Generate another edge[user:like:item] feature in torch tensor\n",
        "edge_like_feat_1_path = os.path.join(base_dir, \"edge-like-feat-1.pt\")\n",
        "edge_like_feat_1 = torch.rand(num_edges, 5)\n",
194
        "print(f\"Part of edge[user:like:item] feature [feat_1]: {edge_like_feat_1[:3, :]}\")\n",
195
        "torch.save(edge_like_feat_1, edge_like_feat_1_path)\n",
196
        "print(f\"Edge[user:like:item] feature [feat_1] is saved to {edge_like_feat_1_path}\\n\")\n",
197
198
199
200
        "\n",
        "# Generate edge[user:follow:user] feature in numpy array.\n",
        "edge_follow_feat_0_path = os.path.join(base_dir, \"edge-follow-feat-0.npy\")\n",
        "edge_follow_feat_0 = np.random.rand(num_edges, 5)\n",
201
        "print(f\"Part of edge[user:follow:user] feature [feat_0]: {edge_follow_feat_0[:3, :]}\")\n",
202
        "np.save(edge_follow_feat_0_path, edge_follow_feat_0)\n",
203
        "print(f\"Edge[user:follow:user] feature [feat_0] is saved to {edge_follow_feat_0_path}\\n\")\n",
204
205
206
207
        "\n",
        "# Generate another edge[user:follow:user] feature in torch tensor\n",
        "edge_follow_feat_1_path = os.path.join(base_dir, \"edge-follow-feat-1.pt\")\n",
        "edge_follow_feat_1 = torch.rand(num_edges, 5)\n",
208
        "print(f\"Part of edge[user:follow:user] feature [feat_1]: {edge_follow_feat_1[:3, :]}\")\n",
209
        "torch.save(edge_follow_feat_1, edge_follow_feat_1_path)\n",
210
        "print(f\"Edge[user:follow:user] feature [feat_1] is saved to {edge_follow_feat_1_path}\\n\")"
211
      ]
212
213
214
    },
    {
      "cell_type": "markdown",
215
216
217
      "metadata": {
        "id": "ZyqgOtsIwzh_"
      },
218
219
220
      "source": [
        "### Generate tasks\n",
        "`OnDiskDataset` supports multiple tasks. For each task, we need to prepare training/validation/test sets respectively. Such sets usually vary among different tasks. In this tutorial, let's create a **Node Classification** task and **Link Prediction** task."
221
      ]
222
223
224
    },
    {
      "cell_type": "markdown",
225
226
227
      "metadata": {
        "id": "hVxHaDIfzCkr"
      },
228
229
230
      "source": [
        "#### Node Classification Task\n",
        "For node classification task, we need **node IDs** and corresponding **labels** for each training/validation/test set. Like feature data, numpy arrays and torch tensors are supported for these sets."
231
      ]
232
233
234
    },
    {
      "cell_type": "code",
235
236
237
238
239
      "execution_count": null,
      "metadata": {
        "id": "S5-fyBbHzTCO"
      },
      "outputs": [],
240
      "source": [
241
        "# For illustration, let's generate item sets for each node type.\n",
242
243
244
245
        "num_trains = int(num_nodes * 0.6)\n",
        "num_vals = int(num_nodes * 0.2)\n",
        "num_tests = num_nodes - num_trains - num_vals\n",
        "\n",
246
247
248
249
250
251
252
253
254
        "user_ids = np.arange(num_nodes)\n",
        "np.random.shuffle(user_ids)\n",
        "\n",
        "item_ids = np.arange(num_nodes)\n",
        "np.random.shuffle(item_ids)\n",
        "\n",
        "# Train IDs for user.\n",
        "nc_train_user_ids_path = os.path.join(base_dir, \"nc-train-user-ids.npy\")\n",
        "nc_train_user_ids = user_ids[:num_trains]\n",
255
        "print(f\"Part of train ids[user] for node classification: {nc_train_user_ids[:3]}\")\n",
256
        "np.save(nc_train_user_ids_path, nc_train_user_ids)\n",
257
        "print(f\"NC train ids[user] are saved to {nc_train_user_ids_path}\\n\")\n",
258
259
260
261
        "\n",
        "# Train labels for user.\n",
        "nc_train_user_labels_path = os.path.join(base_dir, \"nc-train-user-labels.pt\")\n",
        "nc_train_user_labels = torch.randint(0, 10, (num_trains,))\n",
262
        "print(f\"Part of train labels[user] for node classification: {nc_train_user_labels[:3]}\")\n",
263
        "torch.save(nc_train_user_labels, nc_train_user_labels_path)\n",
264
        "print(f\"NC train labels[user] are saved to {nc_train_user_labels_path}\\n\")\n",
265
266
267
268
        "\n",
        "# Train IDs for item.\n",
        "nc_train_item_ids_path = os.path.join(base_dir, \"nc-train-item-ids.npy\")\n",
        "nc_train_item_ids = item_ids[:num_trains]\n",
269
        "print(f\"Part of train ids[item] for node classification: {nc_train_item_ids[:3]}\")\n",
270
        "np.save(nc_train_item_ids_path, nc_train_item_ids)\n",
271
        "print(f\"NC train ids[item] are saved to {nc_train_item_ids_path}\\n\")\n",
272
273
274
275
        "\n",
        "# Train labels for item.\n",
        "nc_train_item_labels_path = os.path.join(base_dir, \"nc-train-item-labels.pt\")\n",
        "nc_train_item_labels = torch.randint(0, 10, (num_trains,))\n",
276
        "print(f\"Part of train labels[item] for node classification: {nc_train_item_labels[:3]}\")\n",
277
        "torch.save(nc_train_item_labels, nc_train_item_labels_path)\n",
278
        "print(f\"NC train labels[item] are saved to {nc_train_item_labels_path}\\n\")\n",
279
280
281
282
        "\n",
        "# Val IDs for user.\n",
        "nc_val_user_ids_path = os.path.join(base_dir, \"nc-val-user-ids.npy\")\n",
        "nc_val_user_ids = user_ids[num_trains:num_trains+num_vals]\n",
283
        "print(f\"Part of val ids[user] for node classification: {nc_val_user_ids[:3]}\")\n",
284
        "np.save(nc_val_user_ids_path, nc_val_user_ids)\n",
285
        "print(f\"NC val ids[user] are saved to {nc_val_user_ids_path}\\n\")\n",
286
287
288
289
        "\n",
        "# Val labels for user.\n",
        "nc_val_user_labels_path = os.path.join(base_dir, \"nc-val-user-labels.pt\")\n",
        "nc_val_user_labels = torch.randint(0, 10, (num_vals,))\n",
290
        "print(f\"Part of val labels[user] for node classification: {nc_val_user_labels[:3]}\")\n",
291
        "torch.save(nc_val_user_labels, nc_val_user_labels_path)\n",
292
        "print(f\"NC val labels[user] are saved to {nc_val_user_labels_path}\\n\")\n",
293
294
295
296
        "\n",
        "# Val IDs for item.\n",
        "nc_val_item_ids_path = os.path.join(base_dir, \"nc-val-item-ids.npy\")\n",
        "nc_val_item_ids = item_ids[num_trains:num_trains+num_vals]\n",
297
        "print(f\"Part of val ids[item] for node classification: {nc_val_item_ids[:3]}\")\n",
298
        "np.save(nc_val_item_ids_path, nc_val_item_ids)\n",
299
        "print(f\"NC val ids[item] are saved to {nc_val_item_ids_path}\\n\")\n",
300
301
302
303
        "\n",
        "# Val labels for item.\n",
        "nc_val_item_labels_path = os.path.join(base_dir, \"nc-val-item-labels.pt\")\n",
        "nc_val_item_labels = torch.randint(0, 10, (num_vals,))\n",
304
        "print(f\"Part of val labels[item] for node classification: {nc_val_item_labels[:3]}\")\n",
305
        "torch.save(nc_val_item_labels, nc_val_item_labels_path)\n",
306
        "print(f\"NC val labels[item] are saved to {nc_val_item_labels_path}\\n\")\n",
307
308
309
310
        "\n",
        "# Test IDs for user.\n",
        "nc_test_user_ids_path = os.path.join(base_dir, \"nc-test-user-ids.npy\")\n",
        "nc_test_user_ids = user_ids[-num_tests:]\n",
311
        "print(f\"Part of test ids[user] for node classification: {nc_test_user_ids[:3]}\")\n",
312
        "np.save(nc_test_user_ids_path, nc_test_user_ids)\n",
313
        "print(f\"NC test ids[user] are saved to {nc_test_user_ids_path}\\n\")\n",
314
315
316
317
        "\n",
        "# Test labels for user.\n",
        "nc_test_user_labels_path = os.path.join(base_dir, \"nc-test-user-labels.pt\")\n",
        "nc_test_user_labels = torch.randint(0, 10, (num_tests,))\n",
318
        "print(f\"Part of test labels[user] for node classification: {nc_test_user_labels[:3]}\")\n",
319
        "torch.save(nc_test_user_labels, nc_test_user_labels_path)\n",
320
        "print(f\"NC test labels[user] are saved to {nc_test_user_labels_path}\\n\")\n",
321
322
323
324
        "\n",
        "# Test IDs for item.\n",
        "nc_test_item_ids_path = os.path.join(base_dir, \"nc-test-item-ids.npy\")\n",
        "nc_test_item_ids = item_ids[-num_tests:]\n",
325
        "print(f\"Part of test ids[item] for node classification: {nc_test_item_ids[:3]}\")\n",
326
        "np.save(nc_test_item_ids_path, nc_test_item_ids)\n",
327
        "print(f\"NC test ids[item] are saved to {nc_test_item_ids_path}\\n\")\n",
328
329
330
331
        "\n",
        "# Test labels for item.\n",
        "nc_test_item_labels_path = os.path.join(base_dir, \"nc-test-item-labels.pt\")\n",
        "nc_test_item_labels = torch.randint(0, 10, (num_tests,))\n",
332
        "print(f\"Part of test labels[item] for node classification: {nc_test_item_labels[:3]}\")\n",
333
        "torch.save(nc_test_item_labels, nc_test_item_labels_path)\n",
334
        "print(f\"NC test labels[item] are saved to {nc_test_item_labels_path}\\n\")"
335
      ]
336
337
338
339
340
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "LhAcDCHQ_KJ0"
341
342
343
344
345
      },
      "source": [
        "#### Link Prediction Task\n",
        "For link prediction task, we need **seeds** or **corresponding labels and indexes** which representing the pos/neg property and group of the seeds for each training/validation/test set. Like feature data, numpy arrays and torch tensors are supported for these sets."
      ]
346
347
348
    },
    {
      "cell_type": "code",
349
350
351
352
353
      "execution_count": null,
      "metadata": {
        "id": "u0jCnXIcAQy4"
      },
      "outputs": [],
354
      "source": [
355
        "# For illustration, let's generate item sets for each edge type.\n",
356
357
358
359
        "num_trains = int(num_edges * 0.6)\n",
        "num_vals = int(num_edges * 0.2)\n",
        "num_tests = num_edges - num_trains - num_vals\n",
        "\n",
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
        "# Train seeds for user:like:item.\n",
        "lp_train_like_seeds_path = os.path.join(base_dir, \"lp-train-like-seeds.npy\")\n",
        "lp_train_like_seeds = like_edges[:num_trains, :]\n",
        "print(f\"Part of train seeds[user:like:item] for link prediction: {lp_train_like_seeds[:3]}\")\n",
        "np.save(lp_train_like_seeds_path, lp_train_like_seeds)\n",
        "print(f\"LP train seeds[user:like:item] are saved to {lp_train_like_seeds_path}\\n\")\n",
        "\n",
        "# Train seeds for user:follow:user.\n",
        "lp_train_follow_seeds_path = os.path.join(base_dir, \"lp-train-follow-seeds.npy\")\n",
        "lp_train_follow_seeds = follow_edges[:num_trains, :]\n",
        "print(f\"Part of train seeds[user:follow:user] for link prediction: {lp_train_follow_seeds[:3]}\")\n",
        "np.save(lp_train_follow_seeds_path, lp_train_follow_seeds)\n",
        "print(f\"LP train seeds[user:follow:user] are saved to {lp_train_follow_seeds_path}\\n\")\n",
        "\n",
        "# Val seeds for user:like:item.\n",
        "lp_val_like_seeds_path = os.path.join(base_dir, \"lp-val-like-seeds.npy\")\n",
        "lp_val_like_seeds = like_edges[num_trains:num_trains+num_vals, :]\n",
        "lp_val_like_neg_dsts = np.random.randint(0, num_nodes, (num_vals, 10)).reshape(-1)\n",
        "lp_val_like_neg_srcs = np.repeat(lp_val_like_seeds[:,0], 10)\n",
        "lp_val_like_neg_seeds = np.concatenate((lp_val_like_neg_srcs, lp_val_like_neg_dsts)).reshape(2,-1).T\n",
        "lp_val_like_seeds = np.concatenate((lp_val_like_seeds, lp_val_like_neg_seeds))\n",
        "print(f\"Part of val seeds[user:like:item] for link prediction: {lp_val_like_seeds[:3]}\")\n",
        "np.save(lp_val_like_seeds_path, lp_val_like_seeds)\n",
        "print(f\"LP val seeds[user:like:item] are saved to {lp_val_like_seeds_path}\\n\")\n",
        "\n",
        "# Val labels for user:like:item.\n",
        "lp_val_like_labels_path = os.path.join(base_dir, \"lp-val-like-labels.npy\")\n",
        "lp_val_like_labels = np.empty(num_vals * (10 + 1))\n",
        "lp_val_like_labels[:num_vals] = 1\n",
        "lp_val_like_labels[num_vals:] = 0\n",
        "print(f\"Part of val labels[user:like:item] for link prediction: {lp_val_like_labels[:3]}\")\n",
        "np.save(lp_val_like_labels_path, lp_val_like_labels)\n",
        "print(f\"LP val labels[user:like:item] are saved to {lp_val_like_labels_path}\\n\")\n",
        "\n",
        "# Val indexes for user:like:item.\n",
        "lp_val_like_indexes_path = os.path.join(base_dir, \"lp-val-like-indexes.npy\")\n",
        "lp_val_like_indexes = np.arange(0, num_vals)\n",
        "lp_val_like_neg_indexes = np.repeat(lp_val_like_indexes, 10)\n",
        "lp_val_like_indexes = np.concatenate([lp_val_like_indexes, lp_val_like_neg_indexes])\n",
        "print(f\"Part of val indexes[user:like:item] for link prediction: {lp_val_like_indexes[:3]}\")\n",
        "np.save(lp_val_like_indexes_path, lp_val_like_indexes)\n",
        "print(f\"LP val indexes[user:like:item] are saved to {lp_val_like_indexes_path}\\n\")\n",
        "\n",
        "# Val seeds for user:follow:item.\n",
        "lp_val_follow_seeds_path = os.path.join(base_dir, \"lp-val-follow-seeds.npy\")\n",
        "lp_val_follow_seeds = follow_edges[num_trains:num_trains+num_vals, :]\n",
        "lp_val_follow_neg_dsts = np.random.randint(0, num_nodes, (num_vals, 10)).reshape(-1)\n",
        "lp_val_follow_neg_srcs = np.repeat(lp_val_follow_seeds[:,0], 10)\n",
        "lp_val_follow_neg_seeds = np.concatenate((lp_val_follow_neg_srcs, lp_val_follow_neg_dsts)).reshape(2,-1).T\n",
        "lp_val_follow_seeds = np.concatenate((lp_val_follow_seeds, lp_val_follow_neg_seeds))\n",
        "print(f\"Part of val seeds[user:follow:item] for link prediction: {lp_val_follow_seeds[:3]}\")\n",
        "np.save(lp_val_follow_seeds_path, lp_val_follow_seeds)\n",
        "print(f\"LP val seeds[user:follow:item] are saved to {lp_val_follow_seeds_path}\\n\")\n",
        "\n",
        "# Val labels for user:follow:item.\n",
        "lp_val_follow_labels_path = os.path.join(base_dir, \"lp-val-follow-labels.npy\")\n",
        "lp_val_follow_labels = np.empty(num_vals * (10 + 1))\n",
        "lp_val_follow_labels[:num_vals] = 1\n",
        "lp_val_follow_labels[num_vals:] = 0\n",
        "print(f\"Part of val labels[user:follow:item] for link prediction: {lp_val_follow_labels[:3]}\")\n",
        "np.save(lp_val_follow_labels_path, lp_val_follow_labels)\n",
        "print(f\"LP val labels[user:follow:item] are saved to {lp_val_follow_labels_path}\\n\")\n",
        "\n",
        "# Val indexes for user:follow:item.\n",
        "lp_val_follow_indexes_path = os.path.join(base_dir, \"lp-val-follow-indexes.npy\")\n",
        "lp_val_follow_indexes = np.arange(0, num_vals)\n",
        "lp_val_follow_neg_indexes = np.repeat(lp_val_follow_indexes, 10)\n",
        "lp_val_follow_indexes = np.concatenate([lp_val_follow_indexes, lp_val_follow_neg_indexes])\n",
        "print(f\"Part of val indexes[user:follow:item] for link prediction: {lp_val_follow_indexes[:3]}\")\n",
        "np.save(lp_val_follow_indexes_path, lp_val_follow_indexes)\n",
        "print(f\"LP val indexes[user:follow:item] are saved to {lp_val_follow_indexes_path}\\n\")\n",
        "\n",
        "# Test seeds for user:like:item.\n",
        "lp_test_like_seeds_path = os.path.join(base_dir, \"lp-test-like-seeds.npy\")\n",
        "lp_test_like_seeds = like_edges[-num_tests:, :]\n",
        "lp_test_like_neg_dsts = np.random.randint(0, num_nodes, (num_tests, 10)).reshape(-1)\n",
        "lp_test_like_neg_srcs = np.repeat(lp_test_like_seeds[:,0], 10)\n",
        "lp_test_like_neg_seeds = np.concatenate((lp_test_like_neg_srcs, lp_test_like_neg_dsts)).reshape(2,-1).T\n",
        "lp_test_like_seeds = np.concatenate((lp_test_like_seeds, lp_test_like_neg_seeds))\n",
        "print(f\"Part of test seeds[user:like:item] for link prediction: {lp_test_like_seeds[:3]}\")\n",
        "np.save(lp_test_like_seeds_path, lp_test_like_seeds)\n",
        "print(f\"LP test seeds[user:like:item] are saved to {lp_test_like_seeds_path}\\n\")\n",
        "\n",
        "# Test labels for user:like:item.\n",
        "lp_test_like_labels_path = os.path.join(base_dir, \"lp-test-like-labels.npy\")\n",
        "lp_test_like_labels = np.empty(num_tests * (10 + 1))\n",
        "lp_test_like_labels[:num_tests] = 1\n",
        "lp_test_like_labels[num_tests:] = 0\n",
        "print(f\"Part of test labels[user:like:item] for link prediction: {lp_test_like_labels[:3]}\")\n",
        "np.save(lp_test_like_labels_path, lp_test_like_labels)\n",
        "print(f\"LP test labels[user:like:item] are saved to {lp_test_like_labels_path}\\n\")\n",
        "\n",
        "# Test indexes for user:like:item.\n",
        "lp_test_like_indexes_path = os.path.join(base_dir, \"lp-test-like-indexes.npy\")\n",
        "lp_test_like_indexes = np.arange(0, num_tests)\n",
        "lp_test_like_neg_indexes = np.repeat(lp_test_like_indexes, 10)\n",
        "lp_test_like_indexes = np.concatenate([lp_test_like_indexes, lp_test_like_neg_indexes])\n",
        "print(f\"Part of test indexes[user:like:item] for link prediction: {lp_test_like_indexes[:3]}\")\n",
        "np.save(lp_test_like_indexes_path, lp_test_like_indexes)\n",
        "print(f\"LP test indexes[user:like:item] are saved to {lp_test_like_indexes_path}\\n\")\n",
        "\n",
        "# Test seeds for user:follow:item.\n",
        "lp_test_follow_seeds_path = os.path.join(base_dir, \"lp-test-follow-seeds.npy\")\n",
        "lp_test_follow_seeds = follow_edges[-num_tests:, :]\n",
        "lp_test_follow_neg_dsts = np.random.randint(0, num_nodes, (num_tests, 10)).reshape(-1)\n",
        "lp_test_follow_neg_srcs = np.repeat(lp_test_follow_seeds[:,0], 10)\n",
        "lp_test_follow_neg_seeds = np.concatenate((lp_test_follow_neg_srcs, lp_test_follow_neg_dsts)).reshape(2,-1).T\n",
        "lp_test_follow_seeds = np.concatenate((lp_test_follow_seeds, lp_test_follow_neg_seeds))\n",
        "print(f\"Part of test seeds[user:follow:item] for link prediction: {lp_test_follow_seeds[:3]}\")\n",
        "np.save(lp_test_follow_seeds_path, lp_test_follow_seeds)\n",
        "print(f\"LP test seeds[user:follow:item] are saved to {lp_test_follow_seeds_path}\\n\")\n",
        "\n",
        "# Test labels for user:follow:item.\n",
        "lp_test_follow_labels_path = os.path.join(base_dir, \"lp-test-follow-labels.npy\")\n",
        "lp_test_follow_labels = np.empty(num_tests * (10 + 1))\n",
        "lp_test_follow_labels[:num_tests] = 1\n",
        "lp_test_follow_labels[num_tests:] = 0\n",
        "print(f\"Part of test labels[user:follow:item] for link prediction: {lp_test_follow_labels[:3]}\")\n",
        "np.save(lp_test_follow_labels_path, lp_test_follow_labels)\n",
        "print(f\"LP test labels[user:follow:item] are saved to {lp_test_follow_labels_path}\\n\")\n",
        "\n",
        "# Test indexes for user:follow:item.\n",
        "lp_test_follow_indexes_path = os.path.join(base_dir, \"lp-test-follow-indexes.npy\")\n",
        "lp_test_follow_indexes = np.arange(0, num_tests)\n",
        "lp_test_follow_neg_indexes = np.repeat(lp_test_follow_indexes, 10)\n",
        "lp_test_follow_indexes = np.concatenate([lp_test_follow_indexes, lp_test_follow_neg_indexes])\n",
        "print(f\"Part of test indexes[user:follow:item] for link prediction: {lp_test_follow_indexes[:3]}\")\n",
        "np.save(lp_test_follow_indexes_path, lp_test_follow_indexes)\n",
        "print(f\"LP test indexes[user:follow:item] are saved to {lp_test_follow_indexes_path}\\n\")"
      ]
490
491
492
    },
    {
      "cell_type": "markdown",
493
494
495
      "metadata": {
        "id": "wbk6-wxRK-6S"
      },
496
497
      "source": [
        "## Organize Data into YAML File\n",
498
499
        "Now we need to create a `metadata.yaml` file which contains the paths, dadta types of graph structure, feature data, training/validation/test sets. Please note that all path should be relative to `metadata.yaml`.\n",
        "\n",
500
501
502
503
504
505
506
507
        "For heterogeneous graph, we need to specify the node/edge type in **type** fields. For edge type, canonical etype is required which is a string that's concatenated by source node type, etype, and destination node type together with `:`.\n",
        "\n",
        "Notes:\n",
        "- all path should be relative to `metadata.yaml`.\n",
        "- Below fields are optional and not specified in below example.\n",
        "  - `in_memory`: indicates whether to load dada into memory or `mmap`. Default is `True`.\n",
        "\n",
        "Please refer to [YAML specification](https://github.com/dmlc/dgl/blob/master/docs/source/stochastic_training/ondisk-dataset-specification.rst) for more details."
508
      ]
509
510
511
    },
    {
      "cell_type": "code",
512
513
514
515
516
      "execution_count": null,
      "metadata": {
        "id": "ddGTWW61Lpwp"
      },
      "outputs": [],
517
518
519
520
521
      "source": [
        "yaml_content = f\"\"\"\n",
        "    dataset_name: heterogeneous_graph_nc_lp\n",
        "    graph:\n",
        "      nodes:\n",
522
523
524
525
        "        - type: user\n",
        "          num: {num_nodes}\n",
        "        - type: item\n",
        "          num: {num_nodes}\n",
526
        "      edges:\n",
527
528
529
530
531
532
        "        - type: \"user:like:item\"\n",
        "          format: csv\n",
        "          path: {os.path.basename(like_edges_path)}\n",
        "        - type: \"user:follow:user\"\n",
        "          format: csv\n",
        "          path: {os.path.basename(follow_edges_path)}\n",
533
534
        "    feature_data:\n",
        "      - domain: node\n",
535
536
537
538
539
540
541
542
543
544
545
        "        type: user\n",
        "        name: feat_0\n",
        "        format: numpy\n",
        "        path: {os.path.basename(node_user_feat_0_path)}\n",
        "      - domain: node\n",
        "        type: user\n",
        "        name: feat_1\n",
        "        format: torch\n",
        "        path: {os.path.basename(node_user_feat_1_path)}\n",
        "      - domain: node\n",
        "        type: item\n",
546
547
        "        name: feat_0\n",
        "        format: numpy\n",
548
        "        path: {os.path.basename(node_item_feat_0_path)}\n",
549
        "      - domain: node\n",
550
551
552
553
554
555
556
557
558
559
560
        "        type: item\n",
        "        name: feat_1\n",
        "        format: torch\n",
        "        path: {os.path.basename(node_item_feat_1_path)}\n",
        "      - domain: edge\n",
        "        type: \"user:like:item\"\n",
        "        name: feat_0\n",
        "        format: numpy\n",
        "        path: {os.path.basename(edge_like_feat_0_path)}\n",
        "      - domain: edge\n",
        "        type: \"user:like:item\"\n",
561
562
        "        name: feat_1\n",
        "        format: torch\n",
563
        "        path: {os.path.basename(edge_like_feat_1_path)}\n",
564
        "      - domain: edge\n",
565
        "        type: \"user:follow:user\"\n",
566
567
        "        name: feat_0\n",
        "        format: numpy\n",
568
        "        path: {os.path.basename(edge_follow_feat_0_path)}\n",
569
        "      - domain: edge\n",
570
        "        type: \"user:follow:user\"\n",
571
572
        "        name: feat_1\n",
        "        format: torch\n",
573
        "        path: {os.path.basename(edge_follow_feat_1_path)}\n",
574
575
576
577
        "    tasks:\n",
        "      - name: node_classification\n",
        "        num_classes: 10\n",
        "        train_set:\n",
Rhett Ying's avatar
Rhett Ying committed
578
579
        "          - type: user\n",
        "            data:\n",
580
        "              - name: seeds\n",
581
        "                format: numpy\n",
582
        "                path: {os.path.basename(nc_train_user_ids_path)}\n",
Rhett Ying's avatar
Rhett Ying committed
583
        "              - name: labels\n",
584
        "                format: torch\n",
585
        "                path: {os.path.basename(nc_train_user_labels_path)}\n",
Rhett Ying's avatar
Rhett Ying committed
586
587
        "          - type: item\n",
        "            data:\n",
588
        "              - name: seeds\n",
589
590
        "                format: numpy\n",
        "                path: {os.path.basename(nc_train_item_ids_path)}\n",
Rhett Ying's avatar
Rhett Ying committed
591
        "              - name: labels\n",
592
593
        "                format: torch\n",
        "                path: {os.path.basename(nc_train_item_labels_path)}\n",
594
        "        validation_set:\n",
Rhett Ying's avatar
Rhett Ying committed
595
596
        "          - type: user\n",
        "            data:\n",
597
        "              - name: seeds\n",
598
        "                format: numpy\n",
599
        "                path: {os.path.basename(nc_val_user_ids_path)}\n",
Rhett Ying's avatar
Rhett Ying committed
600
        "              - name: labels\n",
601
        "                format: torch\n",
602
        "                path: {os.path.basename(nc_val_user_labels_path)}\n",
Rhett Ying's avatar
Rhett Ying committed
603
604
        "          - type: item\n",
        "            data:\n",
605
        "              - name: seeds\n",
606
607
        "                format: numpy\n",
        "                path: {os.path.basename(nc_val_item_ids_path)}\n",
Rhett Ying's avatar
Rhett Ying committed
608
        "              - name: labels\n",
609
610
        "                format: torch\n",
        "                path: {os.path.basename(nc_val_item_labels_path)}\n",
611
        "        test_set:\n",
Rhett Ying's avatar
Rhett Ying committed
612
613
        "          - type: user\n",
        "            data:\n",
614
        "              - name: seeds\n",
615
        "                format: numpy\n",
616
        "                path: {os.path.basename(nc_test_user_ids_path)}\n",
Rhett Ying's avatar
Rhett Ying committed
617
        "              - name: labels\n",
618
        "                format: torch\n",
619
        "                path: {os.path.basename(nc_test_user_labels_path)}\n",
Rhett Ying's avatar
Rhett Ying committed
620
621
        "          - type: item\n",
        "            data:\n",
622
        "              - name: seeds\n",
623
624
        "                format: numpy\n",
        "                path: {os.path.basename(nc_test_item_ids_path)}\n",
Rhett Ying's avatar
Rhett Ying committed
625
        "              - name: labels\n",
626
627
        "                format: torch\n",
        "                path: {os.path.basename(nc_test_item_labels_path)}\n",
628
629
630
        "      - name: link_prediction\n",
        "        num_classes: 10\n",
        "        train_set:\n",
Rhett Ying's avatar
Rhett Ying committed
631
632
        "          - type: \"user:like:item\"\n",
        "            data:\n",
633
        "              - name: seeds\n",
634
        "                format: numpy\n",
635
        "                path: {os.path.basename(lp_train_like_seeds_path)}\n",
Rhett Ying's avatar
Rhett Ying committed
636
637
        "          - type: \"user:follow:user\"\n",
        "            data:\n",
638
        "              - name: seeds\n",
639
        "                format: numpy\n",
640
        "                path: {os.path.basename(lp_train_follow_seeds_path)}\n",
641
        "        validation_set:\n",
Rhett Ying's avatar
Rhett Ying committed
642
643
        "          - type: \"user:like:item\"\n",
        "            data:\n",
644
        "              - name: seeds\n",
645
        "                format: numpy\n",
646
647
648
649
650
651
652
        "                path: {os.path.basename(lp_val_like_seeds_path)}\n",
        "              - name: labels\n",
        "                format: numpy\n",
        "                path: {os.path.basename(lp_val_like_labels_path)}\n",
        "              - name: indexes\n",
        "                format: numpy\n",
        "                path: {os.path.basename(lp_val_like_indexes_path)}\n",
Rhett Ying's avatar
Rhett Ying committed
653
654
        "          - type: \"user:follow:user\"\n",
        "            data:\n",
655
        "              - name: seeds\n",
656
        "                format: numpy\n",
657
658
659
660
661
662
663
        "                path: {os.path.basename(lp_val_follow_seeds_path)}\n",
        "              - name: labels\n",
        "                format: numpy\n",
        "                path: {os.path.basename(lp_val_follow_labels_path)}\n",
        "              - name: indexes\n",
        "                format: numpy\n",
        "                path: {os.path.basename(lp_val_follow_indexes_path)}\n",
664
        "        test_set:\n",
Rhett Ying's avatar
Rhett Ying committed
665
666
        "          - type: \"user:like:item\"\n",
        "            data:\n",
667
        "              - name: seeds\n",
668
        "                format: numpy\n",
669
670
671
672
673
674
675
        "                path: {os.path.basename(lp_test_like_seeds_path)}\n",
        "              - name: labels\n",
        "                format: numpy\n",
        "                path: {os.path.basename(lp_test_like_labels_path)}\n",
        "              - name: indexes\n",
        "                format: numpy\n",
        "                path: {os.path.basename(lp_test_like_indexes_path)}\n",
Rhett Ying's avatar
Rhett Ying committed
676
677
        "          - type: \"user:follow:user\"\n",
        "            data:\n",
678
        "              - name: seeds\n",
679
        "                format: numpy\n",
680
681
682
683
684
685
686
        "                path: {os.path.basename(lp_test_follow_seeds_path)}\n",
        "              - name: labels\n",
        "                format: numpy\n",
        "                path: {os.path.basename(lp_test_follow_labels_path)}\n",
        "              - name: indexes\n",
        "                format: numpy\n",
        "                path: {os.path.basename(lp_test_follow_indexes_path)}\n",
687
688
689
690
        "\"\"\"\n",
        "metadata_path = os.path.join(base_dir, \"metadata.yaml\")\n",
        "with open(metadata_path, \"w\") as f:\n",
        "  f.write(yaml_content)"
691
      ]
692
693
694
    },
    {
      "cell_type": "markdown",
695
696
697
      "metadata": {
        "id": "kEfybHGhOW7O"
      },
698
699
700
701
702
703
704
      "source": [
        "## Instantiate `OnDiskDataset`\n",
        "Now we're ready to load dataset via `dgl.graphbolt.OnDiskDataset`. When instantiating, we just pass in the base directory where `metadata.yaml` file lies.\n",
        "\n",
        "During first instantiation, GraphBolt preprocesses the raw data such as constructing `FusedCSCSamplingGraph` from edges. All data including graph, feature data, training/validation/test sets are put into `preprocessed` directory after preprocessing. Any following dataset loading will skip the preprocess stage.\n",
        "\n",
        "After preprocessing, `load()` is required to be called explicitly in order to load graph, feature data and tasks."
705
      ]
706
707
708
    },
    {
      "cell_type": "code",
709
710
711
712
713
      "execution_count": null,
      "metadata": {
        "id": "W58CZoSzOiyo"
      },
      "outputs": [],
714
715
716
      "source": [
        "dataset = gb.OnDiskDataset(base_dir).load()\n",
        "graph = dataset.graph\n",
717
        "print(f\"Loaded graph: {graph}\\n\")\n",
718
719
        "\n",
        "feature = dataset.feature\n",
720
        "print(f\"Loaded feature store: {feature}\\n\")\n",
721
722
723
        "\n",
        "tasks = dataset.tasks\n",
        "nc_task = tasks[0]\n",
724
        "print(f\"Loaded node classification task: {nc_task}\\n\")\n",
725
        "lp_task = tasks[1]\n",
726
        "print(f\"Loaded link prediction task: {lp_task}\\n\")"
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
      ]
    }
  ],
  "metadata": {
    "colab": {
      "private_outputs": true,
      "provenance": []
    },
    "kernelspec": {
      "display_name": "Python 3",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
743
      },
744
745
746
747
748
749
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.10.12"
750
    }
751
752
753
  },
  "nbformat": 4,
  "nbformat_minor": 0
754
}