ondisk_dataset_heterograph.ipynb 32.5 KB
Newer Older
1
2
3
4
5
6
{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "private_outputs": true,
7
      "provenance": []
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    },
    "language_info": {
      "name": "python"
    }
  },
  "cells": [
    {
      "cell_type": "markdown",
      "source": [
        "# OnDiskDataset for Heterogeneous Graph\n",
        "\n",
        "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/dmlc/dgl/blob/master/notebooks/stochastic_training/ondisk_dataset_heterograph.ipynb) [![GitHub](https://img.shields.io/badge/-View%20on%20GitHub-181717?logo=github&logoColor=ffffff)](https://github.com/dmlc/dgl/blob/master/notebooks/stochastic_training/ondisk_dataset_heterograph.ipynb)\n",
        "\n",
25
        "This tutorial shows how to create `OnDiskDataset` for heterogeneous graph that could be used in **GraphBolt** framework. The major difference from creating dataset for homogeneous graph is that we need to specify node/edge types for edges, feature data, training/validation/test sets.\n",
26
27
        "\n",
        "By the end of this tutorial, you will be able to\n",
28
        "\n",
29
30
        "- organize graph structure data.\n",
        "- organize feature data.\n",
31
32
33
34
35
        "- organize training/validation/test set for specific tasks.\n",
        "\n",
        "To create an ``OnDiskDataset`` object, you need to organize all the data including graph structure, feature data and tasks into a directory. The directory should contain a ``metadata.yaml`` file that describes the metadata of the dataset.\n",
        "\n",
        "Now let's generate various data step by step and organize them together to instantiate `OnDiskDataset` finally."
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
      ],
      "metadata": {
        "id": "FnFhPMaAfLtJ"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Install DGL package"
      ],
      "metadata": {
        "id": "Wlb19DtWgtzq"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# Install required packages.\n",
        "import os\n",
        "import torch\n",
        "import numpy as np\n",
        "os.environ['TORCH'] = torch.__version__\n",
        "os.environ['DGLBACKEND'] = \"pytorch\"\n",
        "\n",
        "# Install the CPU version.\n",
        "device = torch.device(\"cpu\")\n",
        "!pip install --pre dgl -f https://data.dgl.ai/wheels-test/repo.html\n",
        "\n",
        "try:\n",
        "    import dgl\n",
        "    import dgl.graphbolt as gb\n",
        "    installed = True\n",
        "except ImportError as error:\n",
        "    installed = False\n",
        "    print(error)\n",
        "print(\"DGL installed!\" if installed else \"DGL not found!\")"
      ],
      "metadata": {
        "id": "UojlT9ZGgyr9"
      },
      "execution_count": null,
      "outputs": []
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Data preparation\n",
        "In order to demonstrate how to organize various data, let's create a base directory first."
      ],
      "metadata": {
        "id": "2R7WnSbjsfbr"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "base_dir = './ondisk_dataset_heterograph'\n",
        "os.makedirs(base_dir, exist_ok=True)\n",
        "print(f\"Created base directory: {base_dir}\")"
      ],
      "metadata": {
        "id": "SZipbzyltLfO"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "### Generate graph structure data\n",
106
        "For heterogeneous graph, we need to save different edge edges(namely node pairs) into separate **CSV** files.\n",
107
        "\n",
108
        "**Note**:\n",
109
        "when saving to file, do not save index and header.\n"
110
111
112
113
114
115
116
117
118
119
      ],
      "metadata": {
        "id": "qhNtIn_xhlnl"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "import numpy as np\n",
        "import pandas as pd\n",
120
121
122
123
124
        "\n",
        "# For simplicity, we create a heterogeneous graph with\n",
        "# 2 node types: `user`, `item`\n",
        "# 2 edge types: `user:like:item`, `user:follow:user`\n",
        "# And each node/edge type has the same number of nodes/edges.\n",
125
126
127
        "num_nodes = 1000\n",
        "num_edges = 10 * num_nodes\n",
        "\n",
128
129
130
        "# Edge type: \"user:like:item\"\n",
        "like_edges_path = os.path.join(base_dir, \"like-edges.csv\")\n",
        "like_edges = np.random.randint(0, num_nodes, size=(num_edges, 2))\n",
131
        "print(f\"Part of [user:like:item] edges: {like_edges[:5, :]}\\n\")\n",
132
133
134
        "\n",
        "df = pd.DataFrame(like_edges)\n",
        "df.to_csv(like_edges_path, index=False, header=False)\n",
135
        "print(f\"[user:like:item] edges are saved into {like_edges_path}\\n\")\n",
136
        "\n",
137
138
139
        "# Edge type: \"user:follow:user\"\n",
        "follow_edges_path = os.path.join(base_dir, \"follow-edges.csv\")\n",
        "follow_edges = np.random.randint(0, num_nodes, size=(num_edges, 2))\n",
140
        "print(f\"Part of [user:follow:user] edges: {follow_edges[:5, :]}\\n\")\n",
141
        "\n",
142
143
        "df = pd.DataFrame(follow_edges)\n",
        "df.to_csv(follow_edges_path, index=False, header=False)\n",
144
        "print(f\"[user:follow:user] edges are saved into {follow_edges_path}\\n\")"
145
146
147
148
149
150
151
152
153
154
155
      ],
      "metadata": {
        "id": "HcBt4G5BmSjr"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "### Generate feature data for graph\n",
156
        "For feature data, numpy arrays and torch tensors are supported for now. Let's generate feature data for each node/edge type."
157
158
159
160
161
162
163
164
      ],
      "metadata": {
        "id": "kh-4cPtzpcaH"
      }
    },
    {
      "cell_type": "code",
      "source": [
165
166
167
        "# Generate node[user] feature in numpy array.\n",
        "node_user_feat_0_path = os.path.join(base_dir, \"node-user-feat-0.npy\")\n",
        "node_user_feat_0 = np.random.rand(num_nodes, 5)\n",
168
        "print(f\"Part of node[user] feature [feat_0]: {node_user_feat_0[:3, :]}\")\n",
169
        "np.save(node_user_feat_0_path, node_user_feat_0)\n",
170
        "print(f\"Node[user] feature [feat_0] is saved to {node_user_feat_0_path}\\n\")\n",
171
172
173
174
        "\n",
        "# Generate another node[user] feature in torch tensor\n",
        "node_user_feat_1_path = os.path.join(base_dir, \"node-user-feat-1.pt\")\n",
        "node_user_feat_1 = torch.rand(num_nodes, 5)\n",
175
        "print(f\"Part of node[user] feature [feat_1]: {node_user_feat_1[:3, :]}\")\n",
176
        "torch.save(node_user_feat_1, node_user_feat_1_path)\n",
177
        "print(f\"Node[user] feature [feat_1] is saved to {node_user_feat_1_path}\\n\")\n",
178
179
180
181
        "\n",
        "# Generate node[item] feature in numpy array.\n",
        "node_item_feat_0_path = os.path.join(base_dir, \"node-item-feat-0.npy\")\n",
        "node_item_feat_0 = np.random.rand(num_nodes, 5)\n",
182
        "print(f\"Part of node[item] feature [feat_0]: {node_item_feat_0[:3, :]}\")\n",
183
        "np.save(node_item_feat_0_path, node_item_feat_0)\n",
184
        "print(f\"Node[item] feature [feat_0] is saved to {node_item_feat_0_path}\\n\")\n",
185
186
187
188
        "\n",
        "# Generate another node[item] feature in torch tensor\n",
        "node_item_feat_1_path = os.path.join(base_dir, \"node-item-feat-1.pt\")\n",
        "node_item_feat_1 = torch.rand(num_nodes, 5)\n",
189
        "print(f\"Part of node[item] feature [feat_1]: {node_item_feat_1[:3, :]}\")\n",
190
        "torch.save(node_item_feat_1, node_item_feat_1_path)\n",
191
        "print(f\"Node[item] feature [feat_1] is saved to {node_item_feat_1_path}\\n\")\n",
192
193
194
195
        "\n",
        "# Generate edge[user:like:item] feature in numpy array.\n",
        "edge_like_feat_0_path = os.path.join(base_dir, \"edge-like-feat-0.npy\")\n",
        "edge_like_feat_0 = np.random.rand(num_edges, 5)\n",
196
        "print(f\"Part of edge[user:like:item] feature [feat_0]: {edge_like_feat_0[:3, :]}\")\n",
197
        "np.save(edge_like_feat_0_path, edge_like_feat_0)\n",
198
        "print(f\"Edge[user:like:item] feature [feat_0] is saved to {edge_like_feat_0_path}\\n\")\n",
199
200
201
202
        "\n",
        "# Generate another edge[user:like:item] feature in torch tensor\n",
        "edge_like_feat_1_path = os.path.join(base_dir, \"edge-like-feat-1.pt\")\n",
        "edge_like_feat_1 = torch.rand(num_edges, 5)\n",
203
        "print(f\"Part of edge[user:like:item] feature [feat_1]: {edge_like_feat_1[:3, :]}\")\n",
204
        "torch.save(edge_like_feat_1, edge_like_feat_1_path)\n",
205
        "print(f\"Edge[user:like:item] feature [feat_1] is saved to {edge_like_feat_1_path}\\n\")\n",
206
207
208
209
        "\n",
        "# Generate edge[user:follow:user] feature in numpy array.\n",
        "edge_follow_feat_0_path = os.path.join(base_dir, \"edge-follow-feat-0.npy\")\n",
        "edge_follow_feat_0 = np.random.rand(num_edges, 5)\n",
210
        "print(f\"Part of edge[user:follow:user] feature [feat_0]: {edge_follow_feat_0[:3, :]}\")\n",
211
        "np.save(edge_follow_feat_0_path, edge_follow_feat_0)\n",
212
        "print(f\"Edge[user:follow:user] feature [feat_0] is saved to {edge_follow_feat_0_path}\\n\")\n",
213
214
215
216
        "\n",
        "# Generate another edge[user:follow:user] feature in torch tensor\n",
        "edge_follow_feat_1_path = os.path.join(base_dir, \"edge-follow-feat-1.pt\")\n",
        "edge_follow_feat_1 = torch.rand(num_edges, 5)\n",
217
        "print(f\"Part of edge[user:follow:user] feature [feat_1]: {edge_follow_feat_1[:3, :]}\")\n",
218
        "torch.save(edge_follow_feat_1, edge_follow_feat_1_path)\n",
219
        "print(f\"Edge[user:follow:user] feature [feat_1] is saved to {edge_follow_feat_1_path}\\n\")"
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
      ],
      "metadata": {
        "id": "_PVu1u5brBhF"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "### Generate tasks\n",
        "`OnDiskDataset` supports multiple tasks. For each task, we need to prepare training/validation/test sets respectively. Such sets usually vary among different tasks. In this tutorial, let's create a **Node Classification** task and **Link Prediction** task."
      ],
      "metadata": {
        "id": "ZyqgOtsIwzh_"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "#### Node Classification Task\n",
        "For node classification task, we need **node IDs** and corresponding **labels** for each training/validation/test set. Like feature data, numpy arrays and torch tensors are supported for these sets."
      ],
      "metadata": {
        "id": "hVxHaDIfzCkr"
      }
    },
    {
      "cell_type": "code",
      "source": [
250
        "# For illustration, let's generate item sets for each node type.\n",
251
252
253
254
        "num_trains = int(num_nodes * 0.6)\n",
        "num_vals = int(num_nodes * 0.2)\n",
        "num_tests = num_nodes - num_trains - num_vals\n",
        "\n",
255
256
257
258
259
260
261
262
263
        "user_ids = np.arange(num_nodes)\n",
        "np.random.shuffle(user_ids)\n",
        "\n",
        "item_ids = np.arange(num_nodes)\n",
        "np.random.shuffle(item_ids)\n",
        "\n",
        "# Train IDs for user.\n",
        "nc_train_user_ids_path = os.path.join(base_dir, \"nc-train-user-ids.npy\")\n",
        "nc_train_user_ids = user_ids[:num_trains]\n",
264
        "print(f\"Part of train ids[user] for node classification: {nc_train_user_ids[:3]}\")\n",
265
        "np.save(nc_train_user_ids_path, nc_train_user_ids)\n",
266
        "print(f\"NC train ids[user] are saved to {nc_train_user_ids_path}\\n\")\n",
267
268
269
270
        "\n",
        "# Train labels for user.\n",
        "nc_train_user_labels_path = os.path.join(base_dir, \"nc-train-user-labels.pt\")\n",
        "nc_train_user_labels = torch.randint(0, 10, (num_trains,))\n",
271
        "print(f\"Part of train labels[user] for node classification: {nc_train_user_labels[:3]}\")\n",
272
        "torch.save(nc_train_user_labels, nc_train_user_labels_path)\n",
273
        "print(f\"NC train labels[user] are saved to {nc_train_user_labels_path}\\n\")\n",
274
275
276
277
        "\n",
        "# Train IDs for item.\n",
        "nc_train_item_ids_path = os.path.join(base_dir, \"nc-train-item-ids.npy\")\n",
        "nc_train_item_ids = item_ids[:num_trains]\n",
278
        "print(f\"Part of train ids[item] for node classification: {nc_train_item_ids[:3]}\")\n",
279
        "np.save(nc_train_item_ids_path, nc_train_item_ids)\n",
280
        "print(f\"NC train ids[item] are saved to {nc_train_item_ids_path}\\n\")\n",
281
282
283
284
        "\n",
        "# Train labels for item.\n",
        "nc_train_item_labels_path = os.path.join(base_dir, \"nc-train-item-labels.pt\")\n",
        "nc_train_item_labels = torch.randint(0, 10, (num_trains,))\n",
285
        "print(f\"Part of train labels[item] for node classification: {nc_train_item_labels[:3]}\")\n",
286
        "torch.save(nc_train_item_labels, nc_train_item_labels_path)\n",
287
        "print(f\"NC train labels[item] are saved to {nc_train_item_labels_path}\\n\")\n",
288
289
290
291
        "\n",
        "# Val IDs for user.\n",
        "nc_val_user_ids_path = os.path.join(base_dir, \"nc-val-user-ids.npy\")\n",
        "nc_val_user_ids = user_ids[num_trains:num_trains+num_vals]\n",
292
        "print(f\"Part of val ids[user] for node classification: {nc_val_user_ids[:3]}\")\n",
293
        "np.save(nc_val_user_ids_path, nc_val_user_ids)\n",
294
        "print(f\"NC val ids[user] are saved to {nc_val_user_ids_path}\\n\")\n",
295
296
297
298
        "\n",
        "# Val labels for user.\n",
        "nc_val_user_labels_path = os.path.join(base_dir, \"nc-val-user-labels.pt\")\n",
        "nc_val_user_labels = torch.randint(0, 10, (num_vals,))\n",
299
        "print(f\"Part of val labels[user] for node classification: {nc_val_user_labels[:3]}\")\n",
300
        "torch.save(nc_val_user_labels, nc_val_user_labels_path)\n",
301
        "print(f\"NC val labels[user] are saved to {nc_val_user_labels_path}\\n\")\n",
302
303
304
305
        "\n",
        "# Val IDs for item.\n",
        "nc_val_item_ids_path = os.path.join(base_dir, \"nc-val-item-ids.npy\")\n",
        "nc_val_item_ids = item_ids[num_trains:num_trains+num_vals]\n",
306
        "print(f\"Part of val ids[item] for node classification: {nc_val_item_ids[:3]}\")\n",
307
        "np.save(nc_val_item_ids_path, nc_val_item_ids)\n",
308
        "print(f\"NC val ids[item] are saved to {nc_val_item_ids_path}\\n\")\n",
309
310
311
312
        "\n",
        "# Val labels for item.\n",
        "nc_val_item_labels_path = os.path.join(base_dir, \"nc-val-item-labels.pt\")\n",
        "nc_val_item_labels = torch.randint(0, 10, (num_vals,))\n",
313
        "print(f\"Part of val labels[item] for node classification: {nc_val_item_labels[:3]}\")\n",
314
        "torch.save(nc_val_item_labels, nc_val_item_labels_path)\n",
315
        "print(f\"NC val labels[item] are saved to {nc_val_item_labels_path}\\n\")\n",
316
317
318
319
        "\n",
        "# Test IDs for user.\n",
        "nc_test_user_ids_path = os.path.join(base_dir, \"nc-test-user-ids.npy\")\n",
        "nc_test_user_ids = user_ids[-num_tests:]\n",
320
        "print(f\"Part of test ids[user] for node classification: {nc_test_user_ids[:3]}\")\n",
321
        "np.save(nc_test_user_ids_path, nc_test_user_ids)\n",
322
        "print(f\"NC test ids[user] are saved to {nc_test_user_ids_path}\\n\")\n",
323
324
325
326
        "\n",
        "# Test labels for user.\n",
        "nc_test_user_labels_path = os.path.join(base_dir, \"nc-test-user-labels.pt\")\n",
        "nc_test_user_labels = torch.randint(0, 10, (num_tests,))\n",
327
        "print(f\"Part of test labels[user] for node classification: {nc_test_user_labels[:3]}\")\n",
328
        "torch.save(nc_test_user_labels, nc_test_user_labels_path)\n",
329
        "print(f\"NC test labels[user] are saved to {nc_test_user_labels_path}\\n\")\n",
330
331
332
333
        "\n",
        "# Test IDs for item.\n",
        "nc_test_item_ids_path = os.path.join(base_dir, \"nc-test-item-ids.npy\")\n",
        "nc_test_item_ids = item_ids[-num_tests:]\n",
334
        "print(f\"Part of test ids[item] for node classification: {nc_test_item_ids[:3]}\")\n",
335
        "np.save(nc_test_item_ids_path, nc_test_item_ids)\n",
336
        "print(f\"NC test ids[item] are saved to {nc_test_item_ids_path}\\n\")\n",
337
338
339
340
        "\n",
        "# Test labels for item.\n",
        "nc_test_item_labels_path = os.path.join(base_dir, \"nc-test-item-labels.pt\")\n",
        "nc_test_item_labels = torch.randint(0, 10, (num_tests,))\n",
341
        "print(f\"Part of test labels[item] for node classification: {nc_test_item_labels[:3]}\")\n",
342
        "torch.save(nc_test_item_labels, nc_test_item_labels_path)\n",
343
        "print(f\"NC test labels[item] are saved to {nc_test_item_labels_path}\\n\")"
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
      ],
      "metadata": {
        "id": "S5-fyBbHzTCO"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "#### Link Prediction Task\n",
        "For link prediction task, we need **node pairs** or **negative src/dsts** for each training/validation/test set. Like feature data, numpy arrays and torch tensors are supported for these sets."
      ],
      "metadata": {
        "id": "LhAcDCHQ_KJ0"
      }
    },
    {
      "cell_type": "code",
      "source": [
364
        "# For illustration, let's generate item sets for each edge type.\n",
365
366
367
368
        "num_trains = int(num_edges * 0.6)\n",
        "num_vals = int(num_edges * 0.2)\n",
        "num_tests = num_edges - num_trains - num_vals\n",
        "\n",
369
370
371
        "# Train node pairs for user:like:item.\n",
        "lp_train_like_node_pairs_path = os.path.join(base_dir, \"lp-train-like-node-pairs.npy\")\n",
        "lp_train_like_node_pairs = like_edges[:num_trains, :]\n",
372
        "print(f\"Part of train node pairs[user:like:item] for link prediction: {lp_train_like_node_pairs[:3]}\")\n",
373
        "np.save(lp_train_like_node_pairs_path, lp_train_like_node_pairs)\n",
374
        "print(f\"LP train node pairs[user:like:item] are saved to {lp_train_like_node_pairs_path}\\n\")\n",
375
376
377
378
        "\n",
        "# Train node pairs for user:follow:user.\n",
        "lp_train_follow_node_pairs_path = os.path.join(base_dir, \"lp-train-follow-node-pairs.npy\")\n",
        "lp_train_follow_node_pairs = follow_edges[:num_trains, :]\n",
379
        "print(f\"Part of train node pairs[user:follow:user] for link prediction: {lp_train_follow_node_pairs[:3]}\")\n",
380
        "np.save(lp_train_follow_node_pairs_path, lp_train_follow_node_pairs)\n",
381
        "print(f\"LP train node pairs[user:follow:user] are saved to {lp_train_follow_node_pairs_path}\\n\")\n",
382
383
384
385
        "\n",
        "# Val node pairs for user:like:item.\n",
        "lp_val_like_node_pairs_path = os.path.join(base_dir, \"lp-val-like-node-pairs.npy\")\n",
        "lp_val_like_node_pairs = like_edges[num_trains:num_trains+num_vals, :]\n",
386
        "print(f\"Part of val node pairs[user:like:item] for link prediction: {lp_val_like_node_pairs[:3]}\")\n",
387
        "np.save(lp_val_like_node_pairs_path, lp_val_like_node_pairs)\n",
388
        "print(f\"LP val node pairs[user:like:item] are saved to {lp_val_like_node_pairs_path}\\n\")\n",
389
390
391
392
        "\n",
        "# Val negative dsts for user:like:item.\n",
        "lp_val_like_neg_dsts_path = os.path.join(base_dir, \"lp-val-like-neg-dsts.pt\")\n",
        "lp_val_like_neg_dsts = torch.randint(0, num_nodes, (num_vals, 10))\n",
393
        "print(f\"Part of val negative dsts[user:like:item] for link prediction: {lp_val_like_neg_dsts[:3]}\")\n",
394
        "torch.save(lp_val_like_neg_dsts, lp_val_like_neg_dsts_path)\n",
395
        "print(f\"LP val negative dsts[user:like:item] are saved to {lp_val_like_neg_dsts_path}\\n\")\n",
396
397
398
399
        "\n",
        "# Val node pairs for user:follow:user.\n",
        "lp_val_follow_node_pairs_path = os.path.join(base_dir, \"lp-val-follow-node-pairs.npy\")\n",
        "lp_val_follow_node_pairs = follow_edges[num_trains:num_trains+num_vals, :]\n",
400
        "print(f\"Part of val node pairs[user:follow:user] for link prediction: {lp_val_follow_node_pairs[:3]}\")\n",
401
        "np.save(lp_val_follow_node_pairs_path, lp_val_follow_node_pairs)\n",
402
        "print(f\"LP val node pairs[user:follow:user] are saved to {lp_val_follow_node_pairs_path}\\n\")\n",
403
404
405
406
        "\n",
        "# Val negative dsts for user:follow:user.\n",
        "lp_val_follow_neg_dsts_path = os.path.join(base_dir, \"lp-val-follow-neg-dsts.pt\")\n",
        "lp_val_follow_neg_dsts = torch.randint(0, num_nodes, (num_vals, 10))\n",
407
        "print(f\"Part of val negative dsts[user:follow:user] for link prediction: {lp_val_follow_neg_dsts[:3]}\")\n",
408
        "torch.save(lp_val_follow_neg_dsts, lp_val_follow_neg_dsts_path)\n",
409
        "print(f\"LP val negative dsts[user:follow:user] are saved to {lp_val_follow_neg_dsts_path}\\n\")\n",
410
411
412
413
        "\n",
        "# Test node paris for user:like:item.\n",
        "lp_test_like_node_pairs_path = os.path.join(base_dir, \"lp-test-like-node-pairs.npy\")\n",
        "lp_test_like_node_pairs = like_edges[-num_tests, :]\n",
414
        "print(f\"Part of test node pairs[user:like:item] for link prediction: {lp_test_like_node_pairs[:3]}\")\n",
415
        "np.save(lp_test_like_node_pairs_path, lp_test_like_node_pairs)\n",
416
        "print(f\"LP test node pairs[user:like:item] are saved to {lp_test_like_node_pairs_path}\\n\")\n",
417
418
419
420
        "\n",
        "# Test negative dsts for user:like:item.\n",
        "lp_test_like_neg_dsts_path = os.path.join(base_dir, \"lp-test-like-neg-dsts.pt\")\n",
        "lp_test_like_neg_dsts = torch.randint(0, num_nodes, (num_tests, 10))\n",
421
        "print(f\"Part of test negative dsts[user:like:item] for link prediction: {lp_test_like_neg_dsts[:3]}\")\n",
422
        "torch.save(lp_test_like_neg_dsts, lp_test_like_neg_dsts_path)\n",
423
        "print(f\"LP test negative dsts[user:like:item] are saved to {lp_test_like_neg_dsts_path}\\n\")\n",
424
425
426
427
        "\n",
        "# Test node paris for user:follow:user.\n",
        "lp_test_follow_node_pairs_path = os.path.join(base_dir, \"lp-test-follow-node-pairs.npy\")\n",
        "lp_test_follow_node_pairs = follow_edges[-num_tests, :]\n",
428
        "print(f\"Part of test node pairs[user:follow:user] for link prediction: {lp_test_follow_node_pairs[:3]}\")\n",
429
        "np.save(lp_test_follow_node_pairs_path, lp_test_follow_node_pairs)\n",
430
        "print(f\"LP test node pairs[user:follow:user] are saved to {lp_test_follow_node_pairs_path}\\n\")\n",
431
432
433
434
        "\n",
        "# Test negative dsts for user:follow:user.\n",
        "lp_test_follow_neg_dsts_path = os.path.join(base_dir, \"lp-test-follow-neg-dsts.pt\")\n",
        "lp_test_follow_neg_dsts = torch.randint(0, num_nodes, (num_tests, 10))\n",
435
        "print(f\"Part of test negative dsts[user:follow:user] for link prediction: {lp_test_follow_neg_dsts[:3]}\")\n",
436
        "torch.save(lp_test_follow_neg_dsts, lp_test_follow_neg_dsts_path)\n",
437
        "print(f\"LP test negative dsts[user:follow:user] are saved to {lp_test_follow_neg_dsts_path}\\n\")"
438
439
440
441
442
443
444
445
446
447
448
      ],
      "metadata": {
        "id": "u0jCnXIcAQy4"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Organize Data into YAML File\n",
449
450
        "Now we need to create a `metadata.yaml` file which contains the paths, dadta types of graph structure, feature data, training/validation/test sets. Please note that all path should be relative to `metadata.yaml`.\n",
        "\n",
451
452
453
454
455
456
457
458
        "For heterogeneous graph, we need to specify the node/edge type in **type** fields. For edge type, canonical etype is required which is a string that's concatenated by source node type, etype, and destination node type together with `:`.\n",
        "\n",
        "Notes:\n",
        "- all path should be relative to `metadata.yaml`.\n",
        "- Below fields are optional and not specified in below example.\n",
        "  - `in_memory`: indicates whether to load dada into memory or `mmap`. Default is `True`.\n",
        "\n",
        "Please refer to [YAML specification](https://github.com/dmlc/dgl/blob/master/docs/source/stochastic_training/ondisk-dataset-specification.rst) for more details."
459
460
461
462
463
464
465
466
467
468
469
470
      ],
      "metadata": {
        "id": "wbk6-wxRK-6S"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "yaml_content = f\"\"\"\n",
        "    dataset_name: heterogeneous_graph_nc_lp\n",
        "    graph:\n",
        "      nodes:\n",
471
472
473
474
        "        - type: user\n",
        "          num: {num_nodes}\n",
        "        - type: item\n",
        "          num: {num_nodes}\n",
475
        "      edges:\n",
476
477
478
479
480
481
        "        - type: \"user:like:item\"\n",
        "          format: csv\n",
        "          path: {os.path.basename(like_edges_path)}\n",
        "        - type: \"user:follow:user\"\n",
        "          format: csv\n",
        "          path: {os.path.basename(follow_edges_path)}\n",
482
483
        "    feature_data:\n",
        "      - domain: node\n",
484
485
486
487
488
489
490
491
492
493
494
        "        type: user\n",
        "        name: feat_0\n",
        "        format: numpy\n",
        "        path: {os.path.basename(node_user_feat_0_path)}\n",
        "      - domain: node\n",
        "        type: user\n",
        "        name: feat_1\n",
        "        format: torch\n",
        "        path: {os.path.basename(node_user_feat_1_path)}\n",
        "      - domain: node\n",
        "        type: item\n",
495
496
        "        name: feat_0\n",
        "        format: numpy\n",
497
        "        path: {os.path.basename(node_item_feat_0_path)}\n",
498
        "      - domain: node\n",
499
500
501
502
503
504
505
506
507
508
509
        "        type: item\n",
        "        name: feat_1\n",
        "        format: torch\n",
        "        path: {os.path.basename(node_item_feat_1_path)}\n",
        "      - domain: edge\n",
        "        type: \"user:like:item\"\n",
        "        name: feat_0\n",
        "        format: numpy\n",
        "        path: {os.path.basename(edge_like_feat_0_path)}\n",
        "      - domain: edge\n",
        "        type: \"user:like:item\"\n",
510
511
        "        name: feat_1\n",
        "        format: torch\n",
512
        "        path: {os.path.basename(edge_like_feat_1_path)}\n",
513
        "      - domain: edge\n",
514
        "        type: \"user:follow:user\"\n",
515
516
        "        name: feat_0\n",
        "        format: numpy\n",
517
        "        path: {os.path.basename(edge_follow_feat_0_path)}\n",
518
        "      - domain: edge\n",
519
        "        type: \"user:follow:user\"\n",
520
521
        "        name: feat_1\n",
        "        format: torch\n",
522
        "        path: {os.path.basename(edge_follow_feat_1_path)}\n",
523
524
525
526
        "    tasks:\n",
        "      - name: node_classification\n",
        "        num_classes: 10\n",
        "        train_set:\n",
Rhett Ying's avatar
Rhett Ying committed
527
528
529
        "          - type: user\n",
        "            data:\n",
        "              - name: seed_nodes\n",
530
        "                format: numpy\n",
531
        "                path: {os.path.basename(nc_train_user_ids_path)}\n",
Rhett Ying's avatar
Rhett Ying committed
532
        "              - name: labels\n",
533
        "                format: torch\n",
534
        "                path: {os.path.basename(nc_train_user_labels_path)}\n",
Rhett Ying's avatar
Rhett Ying committed
535
536
537
        "          - type: item\n",
        "            data:\n",
        "              - name: seed_nodes\n",
538
539
        "                format: numpy\n",
        "                path: {os.path.basename(nc_train_item_ids_path)}\n",
Rhett Ying's avatar
Rhett Ying committed
540
        "              - name: labels\n",
541
542
        "                format: torch\n",
        "                path: {os.path.basename(nc_train_item_labels_path)}\n",
543
        "        validation_set:\n",
Rhett Ying's avatar
Rhett Ying committed
544
545
546
        "          - type: user\n",
        "            data:\n",
        "              - name: seed_nodes\n",
547
        "                format: numpy\n",
548
        "                path: {os.path.basename(nc_val_user_ids_path)}\n",
Rhett Ying's avatar
Rhett Ying committed
549
        "              - name: labels\n",
550
        "                format: torch\n",
551
        "                path: {os.path.basename(nc_val_user_labels_path)}\n",
Rhett Ying's avatar
Rhett Ying committed
552
553
554
        "          - type: item\n",
        "            data:\n",
        "              - name: seed_nodes\n",
555
556
        "                format: numpy\n",
        "                path: {os.path.basename(nc_val_item_ids_path)}\n",
Rhett Ying's avatar
Rhett Ying committed
557
        "              - name: labels\n",
558
559
        "                format: torch\n",
        "                path: {os.path.basename(nc_val_item_labels_path)}\n",
560
        "        test_set:\n",
Rhett Ying's avatar
Rhett Ying committed
561
562
563
        "          - type: user\n",
        "            data:\n",
        "              - name: seed_nodes\n",
564
        "                format: numpy\n",
565
        "                path: {os.path.basename(nc_test_user_ids_path)}\n",
Rhett Ying's avatar
Rhett Ying committed
566
        "              - name: labels\n",
567
        "                format: torch\n",
568
        "                path: {os.path.basename(nc_test_user_labels_path)}\n",
Rhett Ying's avatar
Rhett Ying committed
569
570
571
        "          - type: item\n",
        "            data:\n",
        "              - name: seed_nodes\n",
572
573
        "                format: numpy\n",
        "                path: {os.path.basename(nc_test_item_ids_path)}\n",
Rhett Ying's avatar
Rhett Ying committed
574
        "              - name: labels\n",
575
576
        "                format: torch\n",
        "                path: {os.path.basename(nc_test_item_labels_path)}\n",
577
578
579
        "      - name: link_prediction\n",
        "        num_classes: 10\n",
        "        train_set:\n",
Rhett Ying's avatar
Rhett Ying committed
580
581
582
        "          - type: \"user:like:item\"\n",
        "            data:\n",
        "              - name: node_pairs\n",
583
584
        "                format: numpy\n",
        "                path: {os.path.basename(lp_train_like_node_pairs_path)}\n",
Rhett Ying's avatar
Rhett Ying committed
585
586
587
        "          - type: \"user:follow:user\"\n",
        "            data:\n",
        "              - name: node_pairs\n",
588
        "                format: numpy\n",
589
        "                path: {os.path.basename(lp_train_follow_node_pairs_path)}\n",
590
        "        validation_set:\n",
Rhett Ying's avatar
Rhett Ying committed
591
592
593
        "          - type: \"user:like:item\"\n",
        "            data:\n",
        "              - name: node_pairs\n",
594
        "                format: numpy\n",
595
        "                path: {os.path.basename(lp_val_like_node_pairs_path)}\n",
Rhett Ying's avatar
Rhett Ying committed
596
        "              - name: negative_dsts\n",
597
        "                format: torch\n",
598
        "                path: {os.path.basename(lp_val_like_neg_dsts_path)}\n",
Rhett Ying's avatar
Rhett Ying committed
599
600
601
        "          - type: \"user:follow:user\"\n",
        "            data:\n",
        "              - name: node_pairs\n",
602
603
        "                format: numpy\n",
        "                path: {os.path.basename(lp_val_follow_node_pairs_path)}\n",
Rhett Ying's avatar
Rhett Ying committed
604
        "              - name: negative_dsts\n",
605
606
        "                format: torch\n",
        "                path: {os.path.basename(lp_val_follow_neg_dsts_path)}\n",
607
        "        test_set:\n",
Rhett Ying's avatar
Rhett Ying committed
608
609
610
        "          - type: \"user:like:item\"\n",
        "            data:\n",
        "              - name: node_pairs\n",
611
612
        "                format: numpy\n",
        "                path: {os.path.basename(lp_test_like_node_pairs_path)}\n",
Rhett Ying's avatar
Rhett Ying committed
613
        "              - name: negative_dsts\n",
614
615
        "                format: torch\n",
        "                path: {os.path.basename(lp_test_like_neg_dsts_path)}\n",
Rhett Ying's avatar
Rhett Ying committed
616
617
618
        "          - type: \"user:follow:user\"\n",
        "            data:\n",
        "              - name: node_pairs\n",
619
        "                format: numpy\n",
620
        "                path: {os.path.basename(lp_test_follow_node_pairs_path)}\n",
Rhett Ying's avatar
Rhett Ying committed
621
        "              - name: negative_dsts\n",
622
        "                format: torch\n",
623
        "                path: {os.path.basename(lp_test_follow_neg_dsts_path)}\n",
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
        "\"\"\"\n",
        "metadata_path = os.path.join(base_dir, \"metadata.yaml\")\n",
        "with open(metadata_path, \"w\") as f:\n",
        "  f.write(yaml_content)"
      ],
      "metadata": {
        "id": "ddGTWW61Lpwp"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Instantiate `OnDiskDataset`\n",
        "Now we're ready to load dataset via `dgl.graphbolt.OnDiskDataset`. When instantiating, we just pass in the base directory where `metadata.yaml` file lies.\n",
        "\n",
        "During first instantiation, GraphBolt preprocesses the raw data such as constructing `FusedCSCSamplingGraph` from edges. All data including graph, feature data, training/validation/test sets are put into `preprocessed` directory after preprocessing. Any following dataset loading will skip the preprocess stage.\n",
        "\n",
        "After preprocessing, `load()` is required to be called explicitly in order to load graph, feature data and tasks."
      ],
      "metadata": {
        "id": "kEfybHGhOW7O"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "dataset = gb.OnDiskDataset(base_dir).load()\n",
        "graph = dataset.graph\n",
654
        "print(f\"Loaded graph: {graph}\\n\")\n",
655
656
        "\n",
        "feature = dataset.feature\n",
657
        "print(f\"Loaded feature store: {feature}\\n\")\n",
658
659
660
        "\n",
        "tasks = dataset.tasks\n",
        "nc_task = tasks[0]\n",
661
        "print(f\"Loaded node classification task: {nc_task}\\n\")\n",
662
        "lp_task = tasks[1]\n",
663
        "print(f\"Loaded link prediction task: {lp_task}\\n\")"
664
665
666
667
668
669
      ],
      "metadata": {
        "id": "W58CZoSzOiyo"
      },
      "execution_count": null,
      "outputs": []
670
671
    }
  ]
672
}