Unverified Commit e847fc44 authored by yxy235's avatar yxy235 Committed by GitHub
Browse files

[GraphBolt] Update notebooks about how to construct ondisk datasets. (#7268)


Co-authored-by: default avatarUbuntu <ubuntu@ip-172-31-0-133.us-west-2.compute.internal>
parent 78df8101
{
"nbformat": 4,
"nbformat_minor": 0,
"metadata": {
"colab": {
"private_outputs": true,
"provenance": []
},
"kernelspec": {
"name": "python3",
"display_name": "Python 3"
},
"language_info": {
"name": "python"
}
},
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "FnFhPMaAfLtJ"
},
"source": [
"# OnDiskDataset for Heterogeneous Graph\n",
"\n",
......@@ -33,22 +21,24 @@
"To create an ``OnDiskDataset`` object, you need to organize all the data including graph structure, feature data and tasks into a directory. The directory should contain a ``metadata.yaml`` file that describes the metadata of the dataset.\n",
"\n",
"Now let's generate various data step by step and organize them together to instantiate `OnDiskDataset` finally."
],
"metadata": {
"id": "FnFhPMaAfLtJ"
}
]
},
{
"cell_type": "markdown",
"source": [
"## Install DGL package"
],
"metadata": {
"id": "Wlb19DtWgtzq"
}
},
"source": [
"## Install DGL package"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "UojlT9ZGgyr9"
},
"outputs": [],
"source": [
"# Install required packages.\n",
"import os\n",
......@@ -69,52 +59,52 @@
" installed = False\n",
" print(error)\n",
"print(\"DGL installed!\" if installed else \"DGL not found!\")"
],
"metadata": {
"id": "UojlT9ZGgyr9"
},
"execution_count": null,
"outputs": []
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "2R7WnSbjsfbr"
},
"source": [
"## Data preparation\n",
"In order to demonstrate how to organize various data, let's create a base directory first."
],
"metadata": {
"id": "2R7WnSbjsfbr"
}
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "SZipbzyltLfO"
},
"outputs": [],
"source": [
"base_dir = './ondisk_dataset_heterograph'\n",
"os.makedirs(base_dir, exist_ok=True)\n",
"print(f\"Created base directory: {base_dir}\")"
],
"metadata": {
"id": "SZipbzyltLfO"
},
"execution_count": null,
"outputs": []
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "qhNtIn_xhlnl"
},
"source": [
"### Generate graph structure data\n",
"For heterogeneous graph, we need to save different edge edges(namely node pairs) into separate **Numpy** or **CSV** files.\n",
"For heterogeneous graph, we need to save different edge edges(namely seeds) into separate **Numpy** or **CSV** files.\n",
"\n",
"Note:\n",
"- when saving to **Numpy**, the array requires to be in shape of `(2, N)`. This format is recommended as constructing graph from it is much faster than **CSV** file.\n",
"- when saving to **CSV** file, do not save index and header.\n"
],
"metadata": {
"id": "qhNtIn_xhlnl"
}
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "HcBt4G5BmSjr"
},
"outputs": [],
"source": [
"import numpy as np\n",
"import pandas as pd\n",
......@@ -143,25 +133,25 @@
"df = pd.DataFrame(follow_edges)\n",
"df.to_csv(follow_edges_path, index=False, header=False)\n",
"print(f\"[user:follow:user] edges are saved into {follow_edges_path}\\n\")"
],
"metadata": {
"id": "HcBt4G5BmSjr"
},
"execution_count": null,
"outputs": []
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "kh-4cPtzpcaH"
},
"source": [
"### Generate feature data for graph\n",
"For feature data, numpy arrays and torch tensors are supported for now. Let's generate feature data for each node/edge type."
],
"metadata": {
"id": "kh-4cPtzpcaH"
}
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "_PVu1u5brBhF"
},
"outputs": [],
"source": [
"# Generate node[user] feature in numpy array.\n",
"node_user_feat_0_path = os.path.join(base_dir, \"node-user-feat-0.npy\")\n",
......@@ -218,35 +208,35 @@
"print(f\"Part of edge[user:follow:user] feature [feat_1]: {edge_follow_feat_1[:3, :]}\")\n",
"torch.save(edge_follow_feat_1, edge_follow_feat_1_path)\n",
"print(f\"Edge[user:follow:user] feature [feat_1] is saved to {edge_follow_feat_1_path}\\n\")"
],
"metadata": {
"id": "_PVu1u5brBhF"
},
"execution_count": null,
"outputs": []
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "ZyqgOtsIwzh_"
},
"source": [
"### Generate tasks\n",
"`OnDiskDataset` supports multiple tasks. For each task, we need to prepare training/validation/test sets respectively. Such sets usually vary among different tasks. In this tutorial, let's create a **Node Classification** task and **Link Prediction** task."
],
"metadata": {
"id": "ZyqgOtsIwzh_"
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "hVxHaDIfzCkr"
},
"source": [
"#### Node Classification Task\n",
"For node classification task, we need **node IDs** and corresponding **labels** for each training/validation/test set. Like feature data, numpy arrays and torch tensors are supported for these sets."
],
"metadata": {
"id": "hVxHaDIfzCkr"
}
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "S5-fyBbHzTCO"
},
"outputs": [],
"source": [
"# For illustration, let's generate item sets for each node type.\n",
"num_trains = int(num_nodes * 0.6)\n",
......@@ -342,109 +332,167 @@
"print(f\"Part of test labels[item] for node classification: {nc_test_item_labels[:3]}\")\n",
"torch.save(nc_test_item_labels, nc_test_item_labels_path)\n",
"print(f\"NC test labels[item] are saved to {nc_test_item_labels_path}\\n\")"
],
"metadata": {
"id": "S5-fyBbHzTCO"
},
"execution_count": null,
"outputs": []
]
},
{
"cell_type": "markdown",
"source": [
"#### Link Prediction Task\n",
"For link prediction task, we need **node pairs** or **negative src/dsts** for each training/validation/test set. Like feature data, numpy arrays and torch tensors are supported for these sets."
],
"metadata": {
"id": "LhAcDCHQ_KJ0"
}
},
"source": [
"#### Link Prediction Task\n",
"For link prediction task, we need **seeds** or **corresponding labels and indexes** which representing the pos/neg property and group of the seeds for each training/validation/test set. Like feature data, numpy arrays and torch tensors are supported for these sets."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "u0jCnXIcAQy4"
},
"outputs": [],
"source": [
"# For illustration, let's generate item sets for each edge type.\n",
"num_trains = int(num_edges * 0.6)\n",
"num_vals = int(num_edges * 0.2)\n",
"num_tests = num_edges - num_trains - num_vals\n",
"\n",
"# Train node pairs for user:like:item.\n",
"lp_train_like_node_pairs_path = os.path.join(base_dir, \"lp-train-like-node-pairs.npy\")\n",
"lp_train_like_node_pairs = like_edges[:num_trains, :]\n",
"print(f\"Part of train node pairs[user:like:item] for link prediction: {lp_train_like_node_pairs[:3]}\")\n",
"np.save(lp_train_like_node_pairs_path, lp_train_like_node_pairs)\n",
"print(f\"LP train node pairs[user:like:item] are saved to {lp_train_like_node_pairs_path}\\n\")\n",
"\n",
"# Train node pairs for user:follow:user.\n",
"lp_train_follow_node_pairs_path = os.path.join(base_dir, \"lp-train-follow-node-pairs.npy\")\n",
"lp_train_follow_node_pairs = follow_edges[:num_trains, :]\n",
"print(f\"Part of train node pairs[user:follow:user] for link prediction: {lp_train_follow_node_pairs[:3]}\")\n",
"np.save(lp_train_follow_node_pairs_path, lp_train_follow_node_pairs)\n",
"print(f\"LP train node pairs[user:follow:user] are saved to {lp_train_follow_node_pairs_path}\\n\")\n",
"\n",
"# Val node pairs for user:like:item.\n",
"lp_val_like_node_pairs_path = os.path.join(base_dir, \"lp-val-like-node-pairs.npy\")\n",
"lp_val_like_node_pairs = like_edges[num_trains:num_trains+num_vals, :]\n",
"print(f\"Part of val node pairs[user:like:item] for link prediction: {lp_val_like_node_pairs[:3]}\")\n",
"np.save(lp_val_like_node_pairs_path, lp_val_like_node_pairs)\n",
"print(f\"LP val node pairs[user:like:item] are saved to {lp_val_like_node_pairs_path}\\n\")\n",
"\n",
"# Val negative dsts for user:like:item.\n",
"lp_val_like_neg_dsts_path = os.path.join(base_dir, \"lp-val-like-neg-dsts.pt\")\n",
"lp_val_like_neg_dsts = torch.randint(0, num_nodes, (num_vals, 10))\n",
"print(f\"Part of val negative dsts[user:like:item] for link prediction: {lp_val_like_neg_dsts[:3]}\")\n",
"torch.save(lp_val_like_neg_dsts, lp_val_like_neg_dsts_path)\n",
"print(f\"LP val negative dsts[user:like:item] are saved to {lp_val_like_neg_dsts_path}\\n\")\n",
"\n",
"# Val node pairs for user:follow:user.\n",
"lp_val_follow_node_pairs_path = os.path.join(base_dir, \"lp-val-follow-node-pairs.npy\")\n",
"lp_val_follow_node_pairs = follow_edges[num_trains:num_trains+num_vals, :]\n",
"print(f\"Part of val node pairs[user:follow:user] for link prediction: {lp_val_follow_node_pairs[:3]}\")\n",
"np.save(lp_val_follow_node_pairs_path, lp_val_follow_node_pairs)\n",
"print(f\"LP val node pairs[user:follow:user] are saved to {lp_val_follow_node_pairs_path}\\n\")\n",
"\n",
"# Val negative dsts for user:follow:user.\n",
"lp_val_follow_neg_dsts_path = os.path.join(base_dir, \"lp-val-follow-neg-dsts.pt\")\n",
"lp_val_follow_neg_dsts = torch.randint(0, num_nodes, (num_vals, 10))\n",
"print(f\"Part of val negative dsts[user:follow:user] for link prediction: {lp_val_follow_neg_dsts[:3]}\")\n",
"torch.save(lp_val_follow_neg_dsts, lp_val_follow_neg_dsts_path)\n",
"print(f\"LP val negative dsts[user:follow:user] are saved to {lp_val_follow_neg_dsts_path}\\n\")\n",
"\n",
"# Test node paris for user:like:item.\n",
"lp_test_like_node_pairs_path = os.path.join(base_dir, \"lp-test-like-node-pairs.npy\")\n",
"lp_test_like_node_pairs = like_edges[-num_tests:, :]\n",
"print(f\"Part of test node pairs[user:like:item] for link prediction: {lp_test_like_node_pairs[:3]}\")\n",
"np.save(lp_test_like_node_pairs_path, lp_test_like_node_pairs)\n",
"print(f\"LP test node pairs[user:like:item] are saved to {lp_test_like_node_pairs_path}\\n\")\n",
"\n",
"# Test negative dsts for user:like:item.\n",
"lp_test_like_neg_dsts_path = os.path.join(base_dir, \"lp-test-like-neg-dsts.pt\")\n",
"lp_test_like_neg_dsts = torch.randint(0, num_nodes, (num_tests, 10))\n",
"print(f\"Part of test negative dsts[user:like:item] for link prediction: {lp_test_like_neg_dsts[:3]}\")\n",
"torch.save(lp_test_like_neg_dsts, lp_test_like_neg_dsts_path)\n",
"print(f\"LP test negative dsts[user:like:item] are saved to {lp_test_like_neg_dsts_path}\\n\")\n",
"\n",
"# Test node paris for user:follow:user.\n",
"lp_test_follow_node_pairs_path = os.path.join(base_dir, \"lp-test-follow-node-pairs.npy\")\n",
"lp_test_follow_node_pairs = follow_edges[-num_tests:, :]\n",
"print(f\"Part of test node pairs[user:follow:user] for link prediction: {lp_test_follow_node_pairs[:3]}\")\n",
"np.save(lp_test_follow_node_pairs_path, lp_test_follow_node_pairs)\n",
"print(f\"LP test node pairs[user:follow:user] are saved to {lp_test_follow_node_pairs_path}\\n\")\n",
"\n",
"# Test negative dsts for user:follow:user.\n",
"lp_test_follow_neg_dsts_path = os.path.join(base_dir, \"lp-test-follow-neg-dsts.pt\")\n",
"lp_test_follow_neg_dsts = torch.randint(0, num_nodes, (num_tests, 10))\n",
"print(f\"Part of test negative dsts[user:follow:user] for link prediction: {lp_test_follow_neg_dsts[:3]}\")\n",
"torch.save(lp_test_follow_neg_dsts, lp_test_follow_neg_dsts_path)\n",
"print(f\"LP test negative dsts[user:follow:user] are saved to {lp_test_follow_neg_dsts_path}\\n\")"
],
"metadata": {
"id": "u0jCnXIcAQy4"
},
"execution_count": null,
"outputs": []
"# Train seeds for user:like:item.\n",
"lp_train_like_seeds_path = os.path.join(base_dir, \"lp-train-like-seeds.npy\")\n",
"lp_train_like_seeds = like_edges[:num_trains, :]\n",
"print(f\"Part of train seeds[user:like:item] for link prediction: {lp_train_like_seeds[:3]}\")\n",
"np.save(lp_train_like_seeds_path, lp_train_like_seeds)\n",
"print(f\"LP train seeds[user:like:item] are saved to {lp_train_like_seeds_path}\\n\")\n",
"\n",
"# Train seeds for user:follow:user.\n",
"lp_train_follow_seeds_path = os.path.join(base_dir, \"lp-train-follow-seeds.npy\")\n",
"lp_train_follow_seeds = follow_edges[:num_trains, :]\n",
"print(f\"Part of train seeds[user:follow:user] for link prediction: {lp_train_follow_seeds[:3]}\")\n",
"np.save(lp_train_follow_seeds_path, lp_train_follow_seeds)\n",
"print(f\"LP train seeds[user:follow:user] are saved to {lp_train_follow_seeds_path}\\n\")\n",
"\n",
"# Val seeds for user:like:item.\n",
"lp_val_like_seeds_path = os.path.join(base_dir, \"lp-val-like-seeds.npy\")\n",
"lp_val_like_seeds = like_edges[num_trains:num_trains+num_vals, :]\n",
"lp_val_like_neg_dsts = np.random.randint(0, num_nodes, (num_vals, 10)).reshape(-1)\n",
"lp_val_like_neg_srcs = np.repeat(lp_val_like_seeds[:,0], 10)\n",
"lp_val_like_neg_seeds = np.concatenate((lp_val_like_neg_srcs, lp_val_like_neg_dsts)).reshape(2,-1).T\n",
"lp_val_like_seeds = np.concatenate((lp_val_like_seeds, lp_val_like_neg_seeds))\n",
"print(f\"Part of val seeds[user:like:item] for link prediction: {lp_val_like_seeds[:3]}\")\n",
"np.save(lp_val_like_seeds_path, lp_val_like_seeds)\n",
"print(f\"LP val seeds[user:like:item] are saved to {lp_val_like_seeds_path}\\n\")\n",
"\n",
"# Val labels for user:like:item.\n",
"lp_val_like_labels_path = os.path.join(base_dir, \"lp-val-like-labels.npy\")\n",
"lp_val_like_labels = np.empty(num_vals * (10 + 1))\n",
"lp_val_like_labels[:num_vals] = 1\n",
"lp_val_like_labels[num_vals:] = 0\n",
"print(f\"Part of val labels[user:like:item] for link prediction: {lp_val_like_labels[:3]}\")\n",
"np.save(lp_val_like_labels_path, lp_val_like_labels)\n",
"print(f\"LP val labels[user:like:item] are saved to {lp_val_like_labels_path}\\n\")\n",
"\n",
"# Val indexes for user:like:item.\n",
"lp_val_like_indexes_path = os.path.join(base_dir, \"lp-val-like-indexes.npy\")\n",
"lp_val_like_indexes = np.arange(0, num_vals)\n",
"lp_val_like_neg_indexes = np.repeat(lp_val_like_indexes, 10)\n",
"lp_val_like_indexes = np.concatenate([lp_val_like_indexes, lp_val_like_neg_indexes])\n",
"print(f\"Part of val indexes[user:like:item] for link prediction: {lp_val_like_indexes[:3]}\")\n",
"np.save(lp_val_like_indexes_path, lp_val_like_indexes)\n",
"print(f\"LP val indexes[user:like:item] are saved to {lp_val_like_indexes_path}\\n\")\n",
"\n",
"# Val seeds for user:follow:item.\n",
"lp_val_follow_seeds_path = os.path.join(base_dir, \"lp-val-follow-seeds.npy\")\n",
"lp_val_follow_seeds = follow_edges[num_trains:num_trains+num_vals, :]\n",
"lp_val_follow_neg_dsts = np.random.randint(0, num_nodes, (num_vals, 10)).reshape(-1)\n",
"lp_val_follow_neg_srcs = np.repeat(lp_val_follow_seeds[:,0], 10)\n",
"lp_val_follow_neg_seeds = np.concatenate((lp_val_follow_neg_srcs, lp_val_follow_neg_dsts)).reshape(2,-1).T\n",
"lp_val_follow_seeds = np.concatenate((lp_val_follow_seeds, lp_val_follow_neg_seeds))\n",
"print(f\"Part of val seeds[user:follow:item] for link prediction: {lp_val_follow_seeds[:3]}\")\n",
"np.save(lp_val_follow_seeds_path, lp_val_follow_seeds)\n",
"print(f\"LP val seeds[user:follow:item] are saved to {lp_val_follow_seeds_path}\\n\")\n",
"\n",
"# Val labels for user:follow:item.\n",
"lp_val_follow_labels_path = os.path.join(base_dir, \"lp-val-follow-labels.npy\")\n",
"lp_val_follow_labels = np.empty(num_vals * (10 + 1))\n",
"lp_val_follow_labels[:num_vals] = 1\n",
"lp_val_follow_labels[num_vals:] = 0\n",
"print(f\"Part of val labels[user:follow:item] for link prediction: {lp_val_follow_labels[:3]}\")\n",
"np.save(lp_val_follow_labels_path, lp_val_follow_labels)\n",
"print(f\"LP val labels[user:follow:item] are saved to {lp_val_follow_labels_path}\\n\")\n",
"\n",
"# Val indexes for user:follow:item.\n",
"lp_val_follow_indexes_path = os.path.join(base_dir, \"lp-val-follow-indexes.npy\")\n",
"lp_val_follow_indexes = np.arange(0, num_vals)\n",
"lp_val_follow_neg_indexes = np.repeat(lp_val_follow_indexes, 10)\n",
"lp_val_follow_indexes = np.concatenate([lp_val_follow_indexes, lp_val_follow_neg_indexes])\n",
"print(f\"Part of val indexes[user:follow:item] for link prediction: {lp_val_follow_indexes[:3]}\")\n",
"np.save(lp_val_follow_indexes_path, lp_val_follow_indexes)\n",
"print(f\"LP val indexes[user:follow:item] are saved to {lp_val_follow_indexes_path}\\n\")\n",
"\n",
"# Test seeds for user:like:item.\n",
"lp_test_like_seeds_path = os.path.join(base_dir, \"lp-test-like-seeds.npy\")\n",
"lp_test_like_seeds = like_edges[-num_tests:, :]\n",
"lp_test_like_neg_dsts = np.random.randint(0, num_nodes, (num_tests, 10)).reshape(-1)\n",
"lp_test_like_neg_srcs = np.repeat(lp_test_like_seeds[:,0], 10)\n",
"lp_test_like_neg_seeds = np.concatenate((lp_test_like_neg_srcs, lp_test_like_neg_dsts)).reshape(2,-1).T\n",
"lp_test_like_seeds = np.concatenate((lp_test_like_seeds, lp_test_like_neg_seeds))\n",
"print(f\"Part of test seeds[user:like:item] for link prediction: {lp_test_like_seeds[:3]}\")\n",
"np.save(lp_test_like_seeds_path, lp_test_like_seeds)\n",
"print(f\"LP test seeds[user:like:item] are saved to {lp_test_like_seeds_path}\\n\")\n",
"\n",
"# Test labels for user:like:item.\n",
"lp_test_like_labels_path = os.path.join(base_dir, \"lp-test-like-labels.npy\")\n",
"lp_test_like_labels = np.empty(num_tests * (10 + 1))\n",
"lp_test_like_labels[:num_tests] = 1\n",
"lp_test_like_labels[num_tests:] = 0\n",
"print(f\"Part of test labels[user:like:item] for link prediction: {lp_test_like_labels[:3]}\")\n",
"np.save(lp_test_like_labels_path, lp_test_like_labels)\n",
"print(f\"LP test labels[user:like:item] are saved to {lp_test_like_labels_path}\\n\")\n",
"\n",
"# Test indexes for user:like:item.\n",
"lp_test_like_indexes_path = os.path.join(base_dir, \"lp-test-like-indexes.npy\")\n",
"lp_test_like_indexes = np.arange(0, num_tests)\n",
"lp_test_like_neg_indexes = np.repeat(lp_test_like_indexes, 10)\n",
"lp_test_like_indexes = np.concatenate([lp_test_like_indexes, lp_test_like_neg_indexes])\n",
"print(f\"Part of test indexes[user:like:item] for link prediction: {lp_test_like_indexes[:3]}\")\n",
"np.save(lp_test_like_indexes_path, lp_test_like_indexes)\n",
"print(f\"LP test indexes[user:like:item] are saved to {lp_test_like_indexes_path}\\n\")\n",
"\n",
"# Test seeds for user:follow:item.\n",
"lp_test_follow_seeds_path = os.path.join(base_dir, \"lp-test-follow-seeds.npy\")\n",
"lp_test_follow_seeds = follow_edges[-num_tests:, :]\n",
"lp_test_follow_neg_dsts = np.random.randint(0, num_nodes, (num_tests, 10)).reshape(-1)\n",
"lp_test_follow_neg_srcs = np.repeat(lp_test_follow_seeds[:,0], 10)\n",
"lp_test_follow_neg_seeds = np.concatenate((lp_test_follow_neg_srcs, lp_test_follow_neg_dsts)).reshape(2,-1).T\n",
"lp_test_follow_seeds = np.concatenate((lp_test_follow_seeds, lp_test_follow_neg_seeds))\n",
"print(f\"Part of test seeds[user:follow:item] for link prediction: {lp_test_follow_seeds[:3]}\")\n",
"np.save(lp_test_follow_seeds_path, lp_test_follow_seeds)\n",
"print(f\"LP test seeds[user:follow:item] are saved to {lp_test_follow_seeds_path}\\n\")\n",
"\n",
"# Test labels for user:follow:item.\n",
"lp_test_follow_labels_path = os.path.join(base_dir, \"lp-test-follow-labels.npy\")\n",
"lp_test_follow_labels = np.empty(num_tests * (10 + 1))\n",
"lp_test_follow_labels[:num_tests] = 1\n",
"lp_test_follow_labels[num_tests:] = 0\n",
"print(f\"Part of test labels[user:follow:item] for link prediction: {lp_test_follow_labels[:3]}\")\n",
"np.save(lp_test_follow_labels_path, lp_test_follow_labels)\n",
"print(f\"LP test labels[user:follow:item] are saved to {lp_test_follow_labels_path}\\n\")\n",
"\n",
"# Test indexes for user:follow:item.\n",
"lp_test_follow_indexes_path = os.path.join(base_dir, \"lp-test-follow-indexes.npy\")\n",
"lp_test_follow_indexes = np.arange(0, num_tests)\n",
"lp_test_follow_neg_indexes = np.repeat(lp_test_follow_indexes, 10)\n",
"lp_test_follow_indexes = np.concatenate([lp_test_follow_indexes, lp_test_follow_neg_indexes])\n",
"print(f\"Part of test indexes[user:follow:item] for link prediction: {lp_test_follow_indexes[:3]}\")\n",
"np.save(lp_test_follow_indexes_path, lp_test_follow_indexes)\n",
"print(f\"LP test indexes[user:follow:item] are saved to {lp_test_follow_indexes_path}\\n\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "wbk6-wxRK-6S"
},
"source": [
"## Organize Data into YAML File\n",
"Now we need to create a `metadata.yaml` file which contains the paths, dadta types of graph structure, feature data, training/validation/test sets. Please note that all path should be relative to `metadata.yaml`.\n",
......@@ -457,13 +505,15 @@
" - `in_memory`: indicates whether to load dada into memory or `mmap`. Default is `True`.\n",
"\n",
"Please refer to [YAML specification](https://github.com/dmlc/dgl/blob/master/docs/source/stochastic_training/ondisk-dataset-specification.rst) for more details."
],
"metadata": {
"id": "wbk6-wxRK-6S"
}
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "ddGTWW61Lpwp"
},
"outputs": [],
"source": [
"yaml_content = f\"\"\"\n",
" dataset_name: heterogeneous_graph_nc_lp\n",
......@@ -527,7 +577,7 @@
" train_set:\n",
" - type: user\n",
" data:\n",
" - name: seed_nodes\n",
" - name: seeds\n",
" format: numpy\n",
" path: {os.path.basename(nc_train_user_ids_path)}\n",
" - name: labels\n",
......@@ -535,7 +585,7 @@
" path: {os.path.basename(nc_train_user_labels_path)}\n",
" - type: item\n",
" data:\n",
" - name: seed_nodes\n",
" - name: seeds\n",
" format: numpy\n",
" path: {os.path.basename(nc_train_item_ids_path)}\n",
" - name: labels\n",
......@@ -544,7 +594,7 @@
" validation_set:\n",
" - type: user\n",
" data:\n",
" - name: seed_nodes\n",
" - name: seeds\n",
" format: numpy\n",
" path: {os.path.basename(nc_val_user_ids_path)}\n",
" - name: labels\n",
......@@ -552,7 +602,7 @@
" path: {os.path.basename(nc_val_user_labels_path)}\n",
" - type: item\n",
" data:\n",
" - name: seed_nodes\n",
" - name: seeds\n",
" format: numpy\n",
" path: {os.path.basename(nc_val_item_ids_path)}\n",
" - name: labels\n",
......@@ -561,7 +611,7 @@
" test_set:\n",
" - type: user\n",
" data:\n",
" - name: seed_nodes\n",
" - name: seeds\n",
" format: numpy\n",
" path: {os.path.basename(nc_test_user_ids_path)}\n",
" - name: labels\n",
......@@ -569,7 +619,7 @@
" path: {os.path.basename(nc_test_user_labels_path)}\n",
" - type: item\n",
" data:\n",
" - name: seed_nodes\n",
" - name: seeds\n",
" format: numpy\n",
" path: {os.path.basename(nc_test_item_ids_path)}\n",
" - name: labels\n",
......@@ -580,61 +630,71 @@
" train_set:\n",
" - type: \"user:like:item\"\n",
" data:\n",
" - name: node_pairs\n",
" - name: seeds\n",
" format: numpy\n",
" path: {os.path.basename(lp_train_like_node_pairs_path)}\n",
" path: {os.path.basename(lp_train_like_seeds_path)}\n",
" - type: \"user:follow:user\"\n",
" data:\n",
" - name: node_pairs\n",
" - name: seeds\n",
" format: numpy\n",
" path: {os.path.basename(lp_train_follow_node_pairs_path)}\n",
" path: {os.path.basename(lp_train_follow_seeds_path)}\n",
" validation_set:\n",
" - type: \"user:like:item\"\n",
" data:\n",
" - name: node_pairs\n",
" - name: seeds\n",
" format: numpy\n",
" path: {os.path.basename(lp_val_like_node_pairs_path)}\n",
" - name: negative_dsts\n",
" format: torch\n",
" path: {os.path.basename(lp_val_like_neg_dsts_path)}\n",
" path: {os.path.basename(lp_val_like_seeds_path)}\n",
" - name: labels\n",
" format: numpy\n",
" path: {os.path.basename(lp_val_like_labels_path)}\n",
" - name: indexes\n",
" format: numpy\n",
" path: {os.path.basename(lp_val_like_indexes_path)}\n",
" - type: \"user:follow:user\"\n",
" data:\n",
" - name: node_pairs\n",
" - name: seeds\n",
" format: numpy\n",
" path: {os.path.basename(lp_val_follow_node_pairs_path)}\n",
" - name: negative_dsts\n",
" format: torch\n",
" path: {os.path.basename(lp_val_follow_neg_dsts_path)}\n",
" path: {os.path.basename(lp_val_follow_seeds_path)}\n",
" - name: labels\n",
" format: numpy\n",
" path: {os.path.basename(lp_val_follow_labels_path)}\n",
" - name: indexes\n",
" format: numpy\n",
" path: {os.path.basename(lp_val_follow_indexes_path)}\n",
" test_set:\n",
" - type: \"user:like:item\"\n",
" data:\n",
" - name: node_pairs\n",
" - name: seeds\n",
" format: numpy\n",
" path: {os.path.basename(lp_test_like_node_pairs_path)}\n",
" - name: negative_dsts\n",
" format: torch\n",
" path: {os.path.basename(lp_test_like_neg_dsts_path)}\n",
" path: {os.path.basename(lp_test_like_seeds_path)}\n",
" - name: labels\n",
" format: numpy\n",
" path: {os.path.basename(lp_test_like_labels_path)}\n",
" - name: indexes\n",
" format: numpy\n",
" path: {os.path.basename(lp_test_like_indexes_path)}\n",
" - type: \"user:follow:user\"\n",
" data:\n",
" - name: node_pairs\n",
" - name: seeds\n",
" format: numpy\n",
" path: {os.path.basename(lp_test_follow_node_pairs_path)}\n",
" - name: negative_dsts\n",
" format: torch\n",
" path: {os.path.basename(lp_test_follow_neg_dsts_path)}\n",
" path: {os.path.basename(lp_test_follow_seeds_path)}\n",
" - name: labels\n",
" format: numpy\n",
" path: {os.path.basename(lp_test_follow_labels_path)}\n",
" - name: indexes\n",
" format: numpy\n",
" path: {os.path.basename(lp_test_follow_indexes_path)}\n",
"\"\"\"\n",
"metadata_path = os.path.join(base_dir, \"metadata.yaml\")\n",
"with open(metadata_path, \"w\") as f:\n",
" f.write(yaml_content)"
],
"metadata": {
"id": "ddGTWW61Lpwp"
},
"execution_count": null,
"outputs": []
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "kEfybHGhOW7O"
},
"source": [
"## Instantiate `OnDiskDataset`\n",
"Now we're ready to load dataset via `dgl.graphbolt.OnDiskDataset`. When instantiating, we just pass in the base directory where `metadata.yaml` file lies.\n",
......@@ -642,13 +702,15 @@
"During first instantiation, GraphBolt preprocesses the raw data such as constructing `FusedCSCSamplingGraph` from edges. All data including graph, feature data, training/validation/test sets are put into `preprocessed` directory after preprocessing. Any following dataset loading will skip the preprocess stage.\n",
"\n",
"After preprocessing, `load()` is required to be called explicitly in order to load graph, feature data and tasks."
],
"metadata": {
"id": "kEfybHGhOW7O"
}
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "W58CZoSzOiyo"
},
"outputs": [],
"source": [
"dataset = gb.OnDiskDataset(base_dir).load()\n",
"graph = dataset.graph\n",
......@@ -662,12 +724,31 @@
"print(f\"Loaded node classification task: {nc_task}\\n\")\n",
"lp_task = tasks[1]\n",
"print(f\"Loaded link prediction task: {lp_task}\\n\")"
]
}
],
"metadata": {
"id": "W58CZoSzOiyo"
"colab": {
"private_outputs": true,
"provenance": []
},
"execution_count": null,
"outputs": []
"kernelspec": {
"display_name": "Python 3",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.12"
}
]
},
"nbformat": 4,
"nbformat_minor": 0
}
{
"nbformat": 4,
"nbformat_minor": 0,
"metadata": {
"colab": {
"private_outputs": true,
"provenance": []
},
"kernelspec": {
"name": "python3",
"display_name": "Python 3"
},
"language_info": {
"name": "python"
}
},
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "FnFhPMaAfLtJ"
},
"source": [
"# OnDiskDataset for Homogeneous Graph\n",
"\n",
......@@ -33,22 +21,24 @@
"To create an ``OnDiskDataset`` object, you need to organize all the data including graph structure, feature data and tasks into a directory. The directory should contain a ``metadata.yaml`` file that describes the metadata of the dataset.\n",
"\n",
"Now let's generate various data step by step and organize them together to instantiate `OnDiskDataset` finally."
],
"metadata": {
"id": "FnFhPMaAfLtJ"
}
]
},
{
"cell_type": "markdown",
"source": [
"## Install DGL package"
],
"metadata": {
"id": "Wlb19DtWgtzq"
}
},
"source": [
"## Install DGL package"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "UojlT9ZGgyr9"
},
"outputs": [],
"source": [
"# Install required packages.\n",
"import os\n",
......@@ -69,52 +59,52 @@
" installed = False\n",
" print(error)\n",
"print(\"DGL installed!\" if installed else \"DGL not found!\")"
],
"metadata": {
"id": "UojlT9ZGgyr9"
},
"execution_count": null,
"outputs": []
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "2R7WnSbjsfbr"
},
"source": [
"## Data preparation\n",
"In order to demonstrate how to organize various data, let's create a base directory first."
],
"metadata": {
"id": "2R7WnSbjsfbr"
}
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "SZipbzyltLfO"
},
"outputs": [],
"source": [
"base_dir = './ondisk_dataset_homograph'\n",
"os.makedirs(base_dir, exist_ok=True)\n",
"print(f\"Created base directory: {base_dir}\")"
],
"metadata": {
"id": "SZipbzyltLfO"
},
"execution_count": null,
"outputs": []
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "qhNtIn_xhlnl"
},
"source": [
"### Generate graph structure data\n",
"For homogeneous graph, we just need to save edges(namely node pairs) into **Numpy** or **CSV** file.\n",
"For homogeneous graph, we just need to save edges(namely seeds) into **Numpy** or **CSV** file.\n",
"\n",
"Note:\n",
"- when saving to **Numpy**, the array requires to be in shape of `(2, N)`. This format is recommended as constructing graph from it is much faster than **CSV** file.\n",
"- when saving to **CSV** file, do not save index and header.\n"
],
"metadata": {
"id": "qhNtIn_xhlnl"
}
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "HcBt4G5BmSjr"
},
"outputs": [],
"source": [
"import numpy as np\n",
"import pandas as pd\n",
......@@ -129,25 +119,25 @@
"df.to_csv(edges_path, index=False, header=False)\n",
"\n",
"print(f\"Edges are saved into {edges_path}\")"
],
"metadata": {
"id": "HcBt4G5BmSjr"
},
"execution_count": null,
"outputs": []
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "kh-4cPtzpcaH"
},
"source": [
"### Generate feature data for graph\n",
"For feature data, numpy arrays and torch tensors are supported for now."
],
"metadata": {
"id": "kh-4cPtzpcaH"
}
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "_PVu1u5brBhF"
},
"outputs": [],
"source": [
"# Generate node feature in numpy array.\n",
"node_feat_0_path = os.path.join(base_dir, \"node-feat-0.npy\")\n",
......@@ -176,35 +166,35 @@
"print(f\"Part of edge feature [feat_1]: {edge_feat_1[:3, :]}\")\n",
"torch.save(edge_feat_1, edge_feat_1_path)\n",
"print(f\"Edge feature [feat_1] is saved to {edge_feat_1_path}\\n\")\n"
],
"metadata": {
"id": "_PVu1u5brBhF"
},
"execution_count": null,
"outputs": []
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "ZyqgOtsIwzh_"
},
"source": [
"### Generate tasks\n",
"`OnDiskDataset` supports multiple tasks. For each task, we need to prepare training/validation/test sets respectively. Such sets usually vary among different tasks. In this tutorial, let's create a **Node Classification** task and **Link Prediction** task."
],
"metadata": {
"id": "ZyqgOtsIwzh_"
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "hVxHaDIfzCkr"
},
"source": [
"#### Node Classification Task\n",
"For node classification task, we need **node IDs** and corresponding **labels** for each training/validation/test set. Like feature data, numpy arrays and torch tensors are supported for these sets."
],
"metadata": {
"id": "hVxHaDIfzCkr"
}
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "S5-fyBbHzTCO"
},
"outputs": [],
"source": [
"num_trains = int(num_nodes * 0.6)\n",
"num_vals = int(num_nodes * 0.2)\n",
......@@ -248,68 +238,94 @@
"print(f\"Part of test labels for node classification: {nc_test_labels[:3]}\")\n",
"torch.save(nc_test_labels, nc_test_labels_path)\n",
"print(f\"NC test labels are saved to {nc_test_labels_path}\\n\")"
],
"metadata": {
"id": "S5-fyBbHzTCO"
},
"execution_count": null,
"outputs": []
]
},
{
"cell_type": "markdown",
"source": [
"#### Link Prediction Task\n",
"For link prediction task, we need **node pairs** or **negative src/dsts** for each training/validation/test set. Like feature data, numpy arrays and torch tensors are supported for these sets."
],
"metadata": {
"id": "LhAcDCHQ_KJ0"
}
},
"source": [
"#### Link Prediction Task\n",
"For link prediction task, we need **seeds** or **corresponding labels and indexes** which representing the pos/neg property and group of the seeds for each training/validation/test set. Like feature data, numpy arrays and torch tensors are supported for these sets."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "u0jCnXIcAQy4"
},
"outputs": [],
"source": [
"num_trains = int(num_edges * 0.6)\n",
"num_vals = int(num_edges * 0.2)\n",
"num_tests = num_edges - num_trains - num_vals\n",
"\n",
"lp_train_node_pairs_path = os.path.join(base_dir, \"lp-train-node-pairs.npy\")\n",
"lp_train_node_pairs = edges[:num_trains, :]\n",
"print(f\"Part of train node pairs for link prediction: {lp_train_node_pairs[:3]}\")\n",
"np.save(lp_train_node_pairs_path, lp_train_node_pairs)\n",
"print(f\"LP train node pairs are saved to {lp_train_node_pairs_path}\\n\")\n",
"\n",
"lp_val_node_pairs_path = os.path.join(base_dir, \"lp-val-node-pairs.npy\")\n",
"lp_val_node_pairs = edges[num_trains:num_trains+num_vals, :]\n",
"print(f\"Part of val node pairs for link prediction: {lp_val_node_pairs[:3]}\")\n",
"np.save(lp_val_node_pairs_path, lp_val_node_pairs)\n",
"print(f\"LP val node pairs are saved to {lp_val_node_pairs_path}\\n\")\n",
"\n",
"lp_val_neg_dsts_path = os.path.join(base_dir, \"lp-val-neg-dsts.pt\")\n",
"lp_val_neg_dsts = torch.randint(0, num_nodes, (num_vals, 10))\n",
"print(f\"Part of val negative dsts for link prediction: {lp_val_neg_dsts[:3]}\")\n",
"torch.save(lp_val_neg_dsts, lp_val_neg_dsts_path)\n",
"print(f\"LP val negative dsts are saved to {lp_val_neg_dsts_path}\\n\")\n",
"\n",
"lp_test_node_pairs_path = os.path.join(base_dir, \"lp-test-node-pairs.npy\")\n",
"lp_test_node_pairs = edges[-num_tests:, :]\n",
"print(f\"Part of test node pairs for link prediction: {lp_test_node_pairs[:3]}\")\n",
"np.save(lp_test_node_pairs_path, lp_test_node_pairs)\n",
"print(f\"LP test node pairs are saved to {lp_test_node_pairs_path}\\n\")\n",
"\n",
"lp_test_neg_dsts_path = os.path.join(base_dir, \"lp-test-neg-dsts.pt\")\n",
"lp_test_neg_dsts = torch.randint(0, num_nodes, (num_tests, 10))\n",
"print(f\"Part of test negative dsts for link prediction: {lp_test_neg_dsts[:3]}\")\n",
"torch.save(lp_test_neg_dsts, lp_test_neg_dsts_path)\n",
"print(f\"LP test negative dsts are saved to {lp_test_neg_dsts_path}\\n\")"
],
"metadata": {
"id": "u0jCnXIcAQy4"
},
"execution_count": null,
"outputs": []
"lp_train_seeds_path = os.path.join(base_dir, \"lp-train-seeds.npy\")\n",
"lp_train_seeds = edges[:num_trains, :]\n",
"print(f\"Part of train seeds for link prediction: {lp_train_seeds[:3]}\")\n",
"np.save(lp_train_seeds_path, lp_train_seeds)\n",
"print(f\"LP train seeds are saved to {lp_train_seeds_path}\\n\")\n",
"\n",
"lp_val_seeds_path = os.path.join(base_dir, \"lp-val-seeds.npy\")\n",
"lp_val_seeds = edges[num_trains:num_trains+num_vals, :]\n",
"lp_val_neg_dsts = np.random.randint(0, num_nodes, (num_vals, 10)).reshape(-1)\n",
"lp_val_neg_srcs = np.repeat(lp_val_seeds[:,0], 10)\n",
"lp_val_neg_seeds = np.concatenate((lp_val_neg_srcs, lp_val_neg_dsts)).reshape(2,-1).T\n",
"lp_val_seeds = np.concatenate((lp_val_seeds, lp_val_neg_seeds))\n",
"print(f\"Part of val seeds for link prediction: {lp_val_seeds[:3]}\")\n",
"np.save(lp_val_seeds_path, lp_val_seeds)\n",
"print(f\"LP val seeds are saved to {lp_val_seeds_path}\\n\")\n",
"\n",
"lp_val_labels_path = os.path.join(base_dir, \"lp-val-labels.npy\")\n",
"lp_val_labels = np.empty(num_vals * (10 + 1))\n",
"lp_val_labels[:num_vals] = 1\n",
"lp_val_labels[num_vals:] = 0\n",
"print(f\"Part of val labels for link prediction: {lp_val_labels[:3]}\")\n",
"np.save(lp_val_labels_path, lp_val_labels)\n",
"print(f\"LP val labels are saved to {lp_val_labels_path}\\n\")\n",
"\n",
"lp_val_indexes_path = os.path.join(base_dir, \"lp-val-indexes.npy\")\n",
"lp_val_indexes = np.arange(0, num_vals)\n",
"lp_val_neg_indexes = np.repeat(lp_val_indexes, 10)\n",
"lp_val_indexes = np.concatenate([lp_val_indexes, lp_val_neg_indexes])\n",
"print(f\"Part of val indexes for link prediction: {lp_val_indexes[:3]}\")\n",
"np.save(lp_val_indexes_path, lp_val_indexes)\n",
"print(f\"LP val indexes are saved to {lp_val_indexes_path}\\n\")\n",
"\n",
"lp_test_seeds_path = os.path.join(base_dir, \"lp-test-seeds.npy\")\n",
"lp_test_seeds = edges[-num_tests:, :]\n",
"lp_test_neg_dsts = np.random.randint(0, num_nodes, (num_tests, 10)).reshape(-1)\n",
"lp_test_neg_srcs = np.repeat(lp_test_seeds[:,0], 10)\n",
"lp_test_neg_seeds = np.concatenate((lp_test_neg_srcs, lp_test_neg_dsts)).reshape(2,-1).T\n",
"lp_test_seeds = np.concatenate((lp_test_seeds, lp_test_neg_seeds))\n",
"print(f\"Part of test seeds for link prediction: {lp_test_seeds[:3]}\")\n",
"np.save(lp_test_seeds_path, lp_test_seeds)\n",
"print(f\"LP test seeds are saved to {lp_test_seeds_path}\\n\")\n",
"\n",
"lp_test_labels_path = os.path.join(base_dir, \"lp-test-labels.npy\")\n",
"lp_test_labels = np.empty(num_tests * (10 + 1))\n",
"lp_test_labels[:num_tests] = 1\n",
"lp_test_labels[num_tests:] = 0\n",
"print(f\"Part of val labels for link prediction: {lp_test_labels[:3]}\")\n",
"np.save(lp_test_labels_path, lp_test_labels)\n",
"print(f\"LP test labels are saved to {lp_test_labels_path}\\n\")\n",
"\n",
"lp_test_indexes_path = os.path.join(base_dir, \"lp-test-indexes.npy\")\n",
"lp_test_indexes = np.arange(0, num_tests)\n",
"lp_test_neg_indexes = np.repeat(lp_test_indexes, 10)\n",
"lp_test_indexes = np.concatenate([lp_test_indexes, lp_test_neg_indexes])\n",
"print(f\"Part of test indexes for link prediction: {lp_test_indexes[:3]}\")\n",
"np.save(lp_test_indexes_path, lp_test_indexes)\n",
"print(f\"LP test indexes are saved to {lp_test_indexes_path}\\n\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "wbk6-wxRK-6S"
},
"source": [
"## Organize Data into YAML File\n",
"Now we need to create a `metadata.yaml` file which contains the paths, dadta types of graph structure, feature data, training/validation/test sets.\n",
......@@ -320,13 +336,15 @@
" - `in_memory`: indicates whether to load dada into memory or `mmap`. Default is `True`.\n",
"\n",
"Please refer to [YAML specification](https://github.com/dmlc/dgl/blob/master/docs/source/stochastic_training/ondisk-dataset-specification.rst) for more details."
],
"metadata": {
"id": "wbk6-wxRK-6S"
}
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "ddGTWW61Lpwp"
},
"outputs": [],
"source": [
"yaml_content = f\"\"\"\n",
" dataset_name: homogeneous_graph_nc_lp\n",
......@@ -358,7 +376,7 @@
" num_classes: 10\n",
" train_set:\n",
" - data:\n",
" - name: seed_nodes\n",
" - name: seeds\n",
" format: numpy\n",
" path: {os.path.basename(nc_train_ids_path)}\n",
" - name: labels\n",
......@@ -366,7 +384,7 @@
" path: {os.path.basename(nc_train_labels_path)}\n",
" validation_set:\n",
" - data:\n",
" - name: seed_nodes\n",
" - name: seeds\n",
" format: numpy\n",
" path: {os.path.basename(nc_val_ids_path)}\n",
" - name: labels\n",
......@@ -374,7 +392,7 @@
" path: {os.path.basename(nc_val_labels_path)}\n",
" test_set:\n",
" - data:\n",
" - name: seed_nodes\n",
" - name: seeds\n",
" format: numpy\n",
" path: {os.path.basename(nc_test_ids_path)}\n",
" - name: labels\n",
......@@ -384,38 +402,42 @@
" num_classes: 10\n",
" train_set:\n",
" - data:\n",
" - name: node_pairs\n",
" - name: seeds\n",
" format: numpy\n",
" path: {os.path.basename(lp_train_node_pairs_path)}\n",
" path: {os.path.basename(lp_train_seeds_path)}\n",
" validation_set:\n",
" - data:\n",
" - name: node_pairs\n",
" - name: seeds\n",
" format: numpy\n",
" path: {os.path.basename(lp_val_node_pairs_path)}\n",
" - name: negative_dsts\n",
" format: torch\n",
" path: {os.path.basename(lp_val_neg_dsts_path)}\n",
" path: {os.path.basename(lp_val_seeds_path)}\n",
" - name: labels\n",
" format: numpy\n",
" path: {os.path.basename(lp_val_labels_path)}\n",
" - name: indexes\n",
" format: numpy\n",
" path: {os.path.basename(lp_val_indexes_path)}\n",
" test_set:\n",
" - data:\n",
" - name: node_pairs\n",
" - name: seeds\n",
" format: numpy\n",
" path: {os.path.basename(lp_test_node_pairs_path)}\n",
" - name: negative_dsts\n",
" format: torch\n",
" path: {os.path.basename(lp_test_neg_dsts_path)}\n",
" path: {os.path.basename(lp_test_seeds_path)}\n",
" - name: labels\n",
" format: numpy\n",
" path: {os.path.basename(lp_test_labels_path)}\n",
" - name: indexes\n",
" format: numpy\n",
" path: {os.path.basename(lp_test_indexes_path)}\n",
"\"\"\"\n",
"metadata_path = os.path.join(base_dir, \"metadata.yaml\")\n",
"with open(metadata_path, \"w\") as f:\n",
" f.write(yaml_content)"
],
"metadata": {
"id": "ddGTWW61Lpwp"
},
"execution_count": null,
"outputs": []
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "kEfybHGhOW7O"
},
"source": [
"## Instantiate `OnDiskDataset`\n",
"Now we're ready to load dataset via `dgl.graphbolt.OnDiskDataset`. When instantiating, we just pass in the base directory where `metadata.yaml` file lies.\n",
......@@ -423,13 +445,15 @@
"During first instantiation, GraphBolt preprocesses the raw data such as constructing `FusedCSCSamplingGraph` from edges. All data including graph, feature data, training/validation/test sets are put into `preprocessed` directory after preprocessing. Any following dataset loading will skip the preprocess stage.\n",
"\n",
"After preprocessing, `load()` is required to be called explicitly in order to load graph, feature data and tasks."
],
"metadata": {
"id": "kEfybHGhOW7O"
}
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "W58CZoSzOiyo"
},
"outputs": [],
"source": [
"dataset = gb.OnDiskDataset(base_dir).load()\n",
"graph = dataset.graph\n",
......@@ -443,12 +467,31 @@
"print(f\"Loaded node classification task: {nc_task}\\n\")\n",
"lp_task = tasks[1]\n",
"print(f\"Loaded link prediction task: {lp_task}\\n\")"
]
}
],
"metadata": {
"id": "W58CZoSzOiyo"
"colab": {
"private_outputs": true,
"provenance": []
},
"execution_count": null,
"outputs": []
"kernelspec": {
"display_name": "Python 3",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.12"
}
]
},
"nbformat": 4,
"nbformat_minor": 0
}
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment