[GraphBolt] Update notebooks about how to construct ondisk datasets. (#7268)

Co-authored-by: Ubuntu <ubuntu@ip-172-31-0-133.us-west-2.compute.internal>

[GraphBolt] Update notebooks about how to construct ondisk datasets. (#7268)
Co-authored-by: Ubuntu <ubuntu@ip-172-31-0-133.us-west-2.compute.internal>
e847fc44 · yxy235 · GitHub · 78df8101 · e847fc44 · e847fc44
Unverified Commit e847fc44 authored Apr 07, 2024 by yxy235 Committed by GitHub Apr 07, 2024
2 changed files
--- a/notebooks/stochastic_training/ondisk_dataset_heterograph.ipynb
+++ b/notebooks/stochastic_training/ondisk_dataset_heterograph.ipynb
 {
-  "nbformat": 4,
-  "nbformat_minor": 0,
-  "metadata": {
-    "colab": {
-      "private_outputs": true,
-      "provenance": []
-    },
-    "kernelspec": {
-      "name": "python3",
-      "display_name": "Python 3"
-    },
-    "language_info": {
-      "name": "python"
-    }
-  },
  "cells": [
    {
      "cell_type": "markdown",
+      "metadata": {
+        "id": "FnFhPMaAfLtJ"
+      },
      "source": [
        "# OnDiskDataset for Heterogeneous Graph\n",
        "\n",
@@ -33,22 +21,24 @@
        "To create an ``OnDiskDataset`` object, you need to organize all the data including graph structure, feature data and tasks into a directory. The directory should contain a ``metadata.yaml`` file that describes the metadata of the dataset.\n",
        "\n",
        "Now let's generate various data step by step and organize them together to instantiate `OnDiskDataset` finally."
-      ],
-      "metadata": {
-        "id": "FnFhPMaAfLtJ"
-      }
+      ]
    },
    {
      "cell_type": "markdown",
-      "source": [
-        "## Install DGL package"
-      ],
      "metadata": {
        "id": "Wlb19DtWgtzq"
-      }
+      },
+      "source": [
+        "## Install DGL package"
+      ]
    },
    {
      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "UojlT9ZGgyr9"
+      },
+      "outputs": [],
      "source": [
        "# Install required packages.\n",
        "import os\n",
@@ -69,52 +59,52 @@
        "    installed = False\n",
        "    print(error)\n",
        "print(\"DGL installed!\" if installed else \"DGL not found!\")"
-      ],
-      "metadata": {
-        "id": "UojlT9ZGgyr9"
-      },
-      "execution_count": null,
-      "outputs": []
+      ]
    },
    {
      "cell_type": "markdown",
+      "metadata": {
+        "id": "2R7WnSbjsfbr"
+      },
      "source": [
        "## Data preparation\n",
        "In order to demonstrate how to organize various data, let's create a base directory first."
-      ],
-      "metadata": {
-        "id": "2R7WnSbjsfbr"
-      }
+      ]
    },
    {
      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "SZipbzyltLfO"
+      },
+      "outputs": [],
      "source": [
        "base_dir = './ondisk_dataset_heterograph'\n",
        "os.makedirs(base_dir, exist_ok=True)\n",
        "print(f\"Created base directory: {base_dir}\")"
-      ],
-      "metadata": {
-        "id": "SZipbzyltLfO"
-      },
-      "execution_count": null,
-      "outputs": []
+      ]
    },
    {
      "cell_type": "markdown",
+      "metadata": {
+        "id": "qhNtIn_xhlnl"
+      },
      "source": [
        "### Generate graph structure data\n",
-        "For heterogeneous graph, we need to save different edge edges(namely node pairs) into separate **Numpy** or **CSV** files.\n",
+        "For heterogeneous graph, we need to save different edge edges(namely seeds) into separate **Numpy** or **CSV** files.\n",
        "\n",
        "Note:\n",
        "- when saving to **Numpy**, the array requires to be in shape of `(2, N)`. This format is recommended as constructing graph from it is much faster than **CSV** file.\n",
        "- when saving to **CSV** file, do not save index and header.\n"
-      ],
-      "metadata": {
-        "id": "qhNtIn_xhlnl"
-      }
+      ]
    },
    {
      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "HcBt4G5BmSjr"
+      },
+      "outputs": [],
      "source": [
        "import numpy as np\n",
        "import pandas as pd\n",
@@ -143,25 +133,25 @@
        "df = pd.DataFrame(follow_edges)\n",
        "df.to_csv(follow_edges_path, index=False, header=False)\n",
        "print(f\"[user:follow:user] edges are saved into {follow_edges_path}\\n\")"
-      ],
-      "metadata": {
-        "id": "HcBt4G5BmSjr"
-      },
-      "execution_count": null,
-      "outputs": []
+      ]
    },
    {
      "cell_type": "markdown",
+      "metadata": {
+        "id": "kh-4cPtzpcaH"
+      },
      "source": [
        "### Generate feature data for graph\n",
        "For feature data, numpy arrays and torch tensors are supported for now. Let's generate feature data for each node/edge type."
-      ],
-      "metadata": {
-        "id": "kh-4cPtzpcaH"
-      }
+      ]
    },
    {
      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "_PVu1u5brBhF"
+      },
+      "outputs": [],
      "source": [
        "# Generate node[user] feature in numpy array.\n",
        "node_user_feat_0_path = os.path.join(base_dir, \"node-user-feat-0.npy\")\n",
@@ -218,35 +208,35 @@
        "print(f\"Part of edge[user:follow:user] feature [feat_1]: {edge_follow_feat_1[:3, :]}\")\n",
        "torch.save(edge_follow_feat_1, edge_follow_feat_1_path)\n",
        "print(f\"Edge[user:follow:user] feature [feat_1] is saved to {edge_follow_feat_1_path}\\n\")"
-      ],
-      "metadata": {
-        "id": "_PVu1u5brBhF"
-      },
-      "execution_count": null,
-      "outputs": []
+      ]
    },
    {
      "cell_type": "markdown",
+      "metadata": {
+        "id": "ZyqgOtsIwzh_"
+      },
      "source": [
        "### Generate tasks\n",
        "`OnDiskDataset` supports multiple tasks. For each task, we need to prepare training/validation/test sets respectively. Such sets usually vary among different tasks. In this tutorial, let's create a **Node Classification** task and **Link Prediction** task."
-      ],
-      "metadata": {
-        "id": "ZyqgOtsIwzh_"
-      }
+      ]
    },
    {
      "cell_type": "markdown",
+      "metadata": {
+        "id": "hVxHaDIfzCkr"
+      },
      "source": [
        "#### Node Classification Task\n",
        "For node classification task, we need **node IDs** and corresponding **labels** for each training/validation/test set. Like feature data, numpy arrays and torch tensors are supported for these sets."
-      ],
-      "metadata": {
-        "id": "hVxHaDIfzCkr"
-      }
+      ]
    },
    {
      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "S5-fyBbHzTCO"
+      },
+      "outputs": [],
      "source": [
        "# For illustration, let's generate item sets for each node type.\n",
        "num_trains = int(num_nodes * 0.6)\n",
@@ -342,109 +332,167 @@
        "print(f\"Part of test labels[item] for node classification: {nc_test_item_labels[:3]}\")\n",
        "torch.save(nc_test_item_labels, nc_test_item_labels_path)\n",
        "print(f\"NC test labels[item] are saved to {nc_test_item_labels_path}\\n\")"
-      ],
-      "metadata": {
-        "id": "S5-fyBbHzTCO"
-      },
-      "execution_count": null,
-      "outputs": []
+      ]
    },
    {
      "cell_type": "markdown",
-      "source": [
-        "#### Link Prediction Task\n",
-        "For link prediction task, we need **node pairs** or **negative src/dsts** for each training/validation/test set. Like feature data, numpy arrays and torch tensors are supported for these sets."
-      ],
      "metadata": {
        "id": "LhAcDCHQ_KJ0"
-      }
+      },
+      "source": [
+        "#### Link Prediction Task\n",
+        "For link prediction task, we need **seeds** or **corresponding labels and indexes** which representing the pos/neg property and group of the seeds for each training/validation/test set. Like feature data, numpy arrays and torch tensors are supported for these sets."
+      ]
    },
    {
      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "u0jCnXIcAQy4"
+      },
+      "outputs": [],
      "source": [
        "# For illustration, let's generate item sets for each edge type.\n",
        "num_trains = int(num_edges * 0.6)\n",
        "num_vals = int(num_edges * 0.2)\n",
        "num_tests = num_edges - num_trains - num_vals\n",
        "\n",
-        "# Train node pairs for user:like:item.\n",
-        "lp_train_like_node_pairs_path = os.path.join(base_dir, \"lp-train-like-node-pairs.npy\")\n",
-        "lp_train_like_node_pairs = like_edges[:num_trains, :]\n",
-        "print(f\"Part of train node pairs[user:like:item] for link prediction: {lp_train_like_node_pairs[:3]}\")\n",
-        "np.save(lp_train_like_node_pairs_path, lp_train_like_node_pairs)\n",
-        "print(f\"LP train node pairs[user:like:item] are saved to {lp_train_like_node_pairs_path}\\n\")\n",
-        "\n",
-        "# Train node pairs for user:follow:user.\n",
-        "lp_train_follow_node_pairs_path = os.path.join(base_dir, \"lp-train-follow-node-pairs.npy\")\n",
-        "lp_train_follow_node_pairs = follow_edges[:num_trains, :]\n",
-        "print(f\"Part of train node pairs[user:follow:user] for link prediction: {lp_train_follow_node_pairs[:3]}\")\n",
-        "np.save(lp_train_follow_node_pairs_path, lp_train_follow_node_pairs)\n",
-        "print(f\"LP train node pairs[user:follow:user] are saved to {lp_train_follow_node_pairs_path}\\n\")\n",
-        "\n",
-        "# Val node pairs for user:like:item.\n",
-        "lp_val_like_node_pairs_path = os.path.join(base_dir, \"lp-val-like-node-pairs.npy\")\n",
-        "lp_val_like_node_pairs = like_edges[num_trains:num_trains+num_vals, :]\n",
-        "print(f\"Part of val node pairs[user:like:item] for link prediction: {lp_val_like_node_pairs[:3]}\")\n",
-        "np.save(lp_val_like_node_pairs_path, lp_val_like_node_pairs)\n",
-        "print(f\"LP val node pairs[user:like:item] are saved to {lp_val_like_node_pairs_path}\\n\")\n",
-        "\n",
-        "# Val negative dsts for user:like:item.\n",
-        "lp_val_like_neg_dsts_path = os.path.join(base_dir, \"lp-val-like-neg-dsts.pt\")\n",
-        "lp_val_like_neg_dsts = torch.randint(0, num_nodes, (num_vals, 10))\n",
-        "print(f\"Part of val negative dsts[user:like:item] for link prediction: {lp_val_like_neg_dsts[:3]}\")\n",
-        "torch.save(lp_val_like_neg_dsts, lp_val_like_neg_dsts_path)\n",
-        "print(f\"LP val negative dsts[user:like:item] are saved to {lp_val_like_neg_dsts_path}\\n\")\n",
-        "\n",
-        "# Val node pairs for user:follow:user.\n",
-        "lp_val_follow_node_pairs_path = os.path.join(base_dir, \"lp-val-follow-node-pairs.npy\")\n",
-        "lp_val_follow_node_pairs = follow_edges[num_trains:num_trains+num_vals, :]\n",
-        "print(f\"Part of val node pairs[user:follow:user] for link prediction: {lp_val_follow_node_pairs[:3]}\")\n",
-        "np.save(lp_val_follow_node_pairs_path, lp_val_follow_node_pairs)\n",
-        "print(f\"LP val node pairs[user:follow:user] are saved to {lp_val_follow_node_pairs_path}\\n\")\n",
-        "\n",
-        "# Val negative dsts for user:follow:user.\n",
-        "lp_val_follow_neg_dsts_path = os.path.join(base_dir, \"lp-val-follow-neg-dsts.pt\")\n",
-        "lp_val_follow_neg_dsts = torch.randint(0, num_nodes, (num_vals, 10))\n",
-        "print(f\"Part of val negative dsts[user:follow:user] for link prediction: {lp_val_follow_neg_dsts[:3]}\")\n",
-        "torch.save(lp_val_follow_neg_dsts, lp_val_follow_neg_dsts_path)\n",
-        "print(f\"LP val negative dsts[user:follow:user] are saved to {lp_val_follow_neg_dsts_path}\\n\")\n",
-        "\n",
-        "# Test node paris for user:like:item.\n",
-        "lp_test_like_node_pairs_path = os.path.join(base_dir, \"lp-test-like-node-pairs.npy\")\n",
-        "lp_test_like_node_pairs = like_edges[-num_tests:, :]\n",
-        "print(f\"Part of test node pairs[user:like:item] for link prediction: {lp_test_like_node_pairs[:3]}\")\n",
-        "np.save(lp_test_like_node_pairs_path, lp_test_like_node_pairs)\n",
-        "print(f\"LP test node pairs[user:like:item] are saved to {lp_test_like_node_pairs_path}\\n\")\n",
-        "\n",
-        "# Test negative dsts for user:like:item.\n",
-        "lp_test_like_neg_dsts_path = os.path.join(base_dir, \"lp-test-like-neg-dsts.pt\")\n",
-        "lp_test_like_neg_dsts = torch.randint(0, num_nodes, (num_tests, 10))\n",
-        "print(f\"Part of test negative dsts[user:like:item] for link prediction: {lp_test_like_neg_dsts[:3]}\")\n",
-        "torch.save(lp_test_like_neg_dsts, lp_test_like_neg_dsts_path)\n",
-        "print(f\"LP test negative dsts[user:like:item] are saved to {lp_test_like_neg_dsts_path}\\n\")\n",
-        "\n",
-        "# Test node paris for user:follow:user.\n",
-        "lp_test_follow_node_pairs_path = os.path.join(base_dir, \"lp-test-follow-node-pairs.npy\")\n",
-        "lp_test_follow_node_pairs = follow_edges[-num_tests:, :]\n",
-        "print(f\"Part of test node pairs[user:follow:user] for link prediction: {lp_test_follow_node_pairs[:3]}\")\n",
-        "np.save(lp_test_follow_node_pairs_path, lp_test_follow_node_pairs)\n",
-        "print(f\"LP test node pairs[user:follow:user] are saved to {lp_test_follow_node_pairs_path}\\n\")\n",
-        "\n",
-        "# Test negative dsts for user:follow:user.\n",
-        "lp_test_follow_neg_dsts_path = os.path.join(base_dir, \"lp-test-follow-neg-dsts.pt\")\n",
-        "lp_test_follow_neg_dsts = torch.randint(0, num_nodes, (num_tests, 10))\n",
-        "print(f\"Part of test negative dsts[user:follow:user] for link prediction: {lp_test_follow_neg_dsts[:3]}\")\n",
-        "torch.save(lp_test_follow_neg_dsts, lp_test_follow_neg_dsts_path)\n",
-        "print(f\"LP test negative dsts[user:follow:user] are saved to {lp_test_follow_neg_dsts_path}\\n\")"
-      ],
-      "metadata": {
-        "id": "u0jCnXIcAQy4"
-      },
-      "execution_count": null,
-      "outputs": []
+        "# Train seeds for user:like:item.\n",
+        "lp_train_like_seeds_path = os.path.join(base_dir, \"lp-train-like-seeds.npy\")\n",
+        "lp_train_like_seeds = like_edges[:num_trains, :]\n",
+        "print(f\"Part of train seeds[user:like:item] for link prediction: {lp_train_like_seeds[:3]}\")\n",
+        "np.save(lp_train_like_seeds_path, lp_train_like_seeds)\n",
+        "print(f\"LP train seeds[user:like:item] are saved to {lp_train_like_seeds_path}\\n\")\n",
+        "\n",
+        "# Train seeds for user:follow:user.\n",
+        "lp_train_follow_seeds_path = os.path.join(base_dir, \"lp-train-follow-seeds.npy\")\n",
+        "lp_train_follow_seeds = follow_edges[:num_trains, :]\n",
+        "print(f\"Part of train seeds[user:follow:user] for link prediction: {lp_train_follow_seeds[:3]}\")\n",
+        "np.save(lp_train_follow_seeds_path, lp_train_follow_seeds)\n",
+        "print(f\"LP train seeds[user:follow:user] are saved to {lp_train_follow_seeds_path}\\n\")\n",
+        "\n",
+        "# Val seeds for user:like:item.\n",
+        "lp_val_like_seeds_path = os.path.join(base_dir, \"lp-val-like-seeds.npy\")\n",
+        "lp_val_like_seeds = like_edges[num_trains:num_trains+num_vals, :]\n",
+        "lp_val_like_neg_dsts = np.random.randint(0, num_nodes, (num_vals, 10)).reshape(-1)\n",
+        "lp_val_like_neg_srcs = np.repeat(lp_val_like_seeds[:,0], 10)\n",
+        "lp_val_like_neg_seeds = np.concatenate((lp_val_like_neg_srcs, lp_val_like_neg_dsts)).reshape(2,-1).T\n",
+        "lp_val_like_seeds = np.concatenate((lp_val_like_seeds, lp_val_like_neg_seeds))\n",
+        "print(f\"Part of val seeds[user:like:item] for link prediction: {lp_val_like_seeds[:3]}\")\n",
+        "np.save(lp_val_like_seeds_path, lp_val_like_seeds)\n",
+        "print(f\"LP val seeds[user:like:item] are saved to {lp_val_like_seeds_path}\\n\")\n",
+        "\n",
+        "# Val labels for user:like:item.\n",
+        "lp_val_like_labels_path = os.path.join(base_dir, \"lp-val-like-labels.npy\")\n",
+        "lp_val_like_labels = np.empty(num_vals * (10 + 1))\n",
+        "lp_val_like_labels[:num_vals] = 1\n",
+        "lp_val_like_labels[num_vals:] = 0\n",
+        "print(f\"Part of val labels[user:like:item] for link prediction: {lp_val_like_labels[:3]}\")\n",
+        "np.save(lp_val_like_labels_path, lp_val_like_labels)\n",
+        "print(f\"LP val labels[user:like:item] are saved to {lp_val_like_labels_path}\\n\")\n",
+        "\n",
+        "# Val indexes for user:like:item.\n",
+        "lp_val_like_indexes_path = os.path.join(base_dir, \"lp-val-like-indexes.npy\")\n",
+        "lp_val_like_indexes = np.arange(0, num_vals)\n",
+        "lp_val_like_neg_indexes = np.repeat(lp_val_like_indexes, 10)\n",
+        "lp_val_like_indexes = np.concatenate([lp_val_like_indexes, lp_val_like_neg_indexes])\n",
+        "print(f\"Part of val indexes[user:like:item] for link prediction: {lp_val_like_indexes[:3]}\")\n",
+        "np.save(lp_val_like_indexes_path, lp_val_like_indexes)\n",
+        "print(f\"LP val indexes[user:like:item] are saved to {lp_val_like_indexes_path}\\n\")\n",
+        "\n",
+        "# Val seeds for user:follow:item.\n",
+        "lp_val_follow_seeds_path = os.path.join(base_dir, \"lp-val-follow-seeds.npy\")\n",
+        "lp_val_follow_seeds = follow_edges[num_trains:num_trains+num_vals, :]\n",
+        "lp_val_follow_neg_dsts = np.random.randint(0, num_nodes, (num_vals, 10)).reshape(-1)\n",
+        "lp_val_follow_neg_srcs = np.repeat(lp_val_follow_seeds[:,0], 10)\n",
+        "lp_val_follow_neg_seeds = np.concatenate((lp_val_follow_neg_srcs, lp_val_follow_neg_dsts)).reshape(2,-1).T\n",
+        "lp_val_follow_seeds = np.concatenate((lp_val_follow_seeds, lp_val_follow_neg_seeds))\n",
+        "print(f\"Part of val seeds[user:follow:item] for link prediction: {lp_val_follow_seeds[:3]}\")\n",
+        "np.save(lp_val_follow_seeds_path, lp_val_follow_seeds)\n",
+        "print(f\"LP val seeds[user:follow:item] are saved to {lp_val_follow_seeds_path}\\n\")\n",
+        "\n",
+        "# Val labels for user:follow:item.\n",
+        "lp_val_follow_labels_path = os.path.join(base_dir, \"lp-val-follow-labels.npy\")\n",
+        "lp_val_follow_labels = np.empty(num_vals * (10 + 1))\n",
+        "lp_val_follow_labels[:num_vals] = 1\n",
+        "lp_val_follow_labels[num_vals:] = 0\n",
+        "print(f\"Part of val labels[user:follow:item] for link prediction: {lp_val_follow_labels[:3]}\")\n",
+        "np.save(lp_val_follow_labels_path, lp_val_follow_labels)\n",
+        "print(f\"LP val labels[user:follow:item] are saved to {lp_val_follow_labels_path}\\n\")\n",
+        "\n",
+        "# Val indexes for user:follow:item.\n",
+        "lp_val_follow_indexes_path = os.path.join(base_dir, \"lp-val-follow-indexes.npy\")\n",
+        "lp_val_follow_indexes = np.arange(0, num_vals)\n",
+        "lp_val_follow_neg_indexes = np.repeat(lp_val_follow_indexes, 10)\n",
+        "lp_val_follow_indexes = np.concatenate([lp_val_follow_indexes, lp_val_follow_neg_indexes])\n",
+        "print(f\"Part of val indexes[user:follow:item] for link prediction: {lp_val_follow_indexes[:3]}\")\n",
+        "np.save(lp_val_follow_indexes_path, lp_val_follow_indexes)\n",
+        "print(f\"LP val indexes[user:follow:item] are saved to {lp_val_follow_indexes_path}\\n\")\n",
+        "\n",
+        "# Test seeds for user:like:item.\n",
+        "lp_test_like_seeds_path = os.path.join(base_dir, \"lp-test-like-seeds.npy\")\n",
+        "lp_test_like_seeds = like_edges[-num_tests:, :]\n",
+        "lp_test_like_neg_dsts = np.random.randint(0, num_nodes, (num_tests, 10)).reshape(-1)\n",
+        "lp_test_like_neg_srcs = np.repeat(lp_test_like_seeds[:,0], 10)\n",
+        "lp_test_like_neg_seeds = np.concatenate((lp_test_like_neg_srcs, lp_test_like_neg_dsts)).reshape(2,-1).T\n",
+        "lp_test_like_seeds = np.concatenate((lp_test_like_seeds, lp_test_like_neg_seeds))\n",
+        "print(f\"Part of test seeds[user:like:item] for link prediction: {lp_test_like_seeds[:3]}\")\n",
+        "np.save(lp_test_like_seeds_path, lp_test_like_seeds)\n",
+        "print(f\"LP test seeds[user:like:item] are saved to {lp_test_like_seeds_path}\\n\")\n",
+        "\n",
+        "# Test labels for user:like:item.\n",
+        "lp_test_like_labels_path = os.path.join(base_dir, \"lp-test-like-labels.npy\")\n",
+        "lp_test_like_labels = np.empty(num_tests * (10 + 1))\n",
+        "lp_test_like_labels[:num_tests] = 1\n",
+        "lp_test_like_labels[num_tests:] = 0\n",
+        "print(f\"Part of test labels[user:like:item] for link prediction: {lp_test_like_labels[:3]}\")\n",
+        "np.save(lp_test_like_labels_path, lp_test_like_labels)\n",
+        "print(f\"LP test labels[user:like:item] are saved to {lp_test_like_labels_path}\\n\")\n",
+        "\n",
+        "# Test indexes for user:like:item.\n",
+        "lp_test_like_indexes_path = os.path.join(base_dir, \"lp-test-like-indexes.npy\")\n",
+        "lp_test_like_indexes = np.arange(0, num_tests)\n",
+        "lp_test_like_neg_indexes = np.repeat(lp_test_like_indexes, 10)\n",
+        "lp_test_like_indexes = np.concatenate([lp_test_like_indexes, lp_test_like_neg_indexes])\n",
+        "print(f\"Part of test indexes[user:like:item] for link prediction: {lp_test_like_indexes[:3]}\")\n",
+        "np.save(lp_test_like_indexes_path, lp_test_like_indexes)\n",
+        "print(f\"LP test indexes[user:like:item] are saved to {lp_test_like_indexes_path}\\n\")\n",
+        "\n",
+        "# Test seeds for user:follow:item.\n",
+        "lp_test_follow_seeds_path = os.path.join(base_dir, \"lp-test-follow-seeds.npy\")\n",
+        "lp_test_follow_seeds = follow_edges[-num_tests:, :]\n",
+        "lp_test_follow_neg_dsts = np.random.randint(0, num_nodes, (num_tests, 10)).reshape(-1)\n",
+        "lp_test_follow_neg_srcs = np.repeat(lp_test_follow_seeds[:,0], 10)\n",
+        "lp_test_follow_neg_seeds = np.concatenate((lp_test_follow_neg_srcs, lp_test_follow_neg_dsts)).reshape(2,-1).T\n",
+        "lp_test_follow_seeds = np.concatenate((lp_test_follow_seeds, lp_test_follow_neg_seeds))\n",
+        "print(f\"Part of test seeds[user:follow:item] for link prediction: {lp_test_follow_seeds[:3]}\")\n",
+        "np.save(lp_test_follow_seeds_path, lp_test_follow_seeds)\n",
+        "print(f\"LP test seeds[user:follow:item] are saved to {lp_test_follow_seeds_path}\\n\")\n",
+        "\n",
+        "# Test labels for user:follow:item.\n",
+        "lp_test_follow_labels_path = os.path.join(base_dir, \"lp-test-follow-labels.npy\")\n",
+        "lp_test_follow_labels = np.empty(num_tests * (10 + 1))\n",
+        "lp_test_follow_labels[:num_tests] = 1\n",
+        "lp_test_follow_labels[num_tests:] = 0\n",
+        "print(f\"Part of test labels[user:follow:item] for link prediction: {lp_test_follow_labels[:3]}\")\n",
+        "np.save(lp_test_follow_labels_path, lp_test_follow_labels)\n",
+        "print(f\"LP test labels[user:follow:item] are saved to {lp_test_follow_labels_path}\\n\")\n",
+        "\n",
+        "# Test indexes for user:follow:item.\n",
+        "lp_test_follow_indexes_path = os.path.join(base_dir, \"lp-test-follow-indexes.npy\")\n",
+        "lp_test_follow_indexes = np.arange(0, num_tests)\n",
+        "lp_test_follow_neg_indexes = np.repeat(lp_test_follow_indexes, 10)\n",
+        "lp_test_follow_indexes = np.concatenate([lp_test_follow_indexes, lp_test_follow_neg_indexes])\n",
+        "print(f\"Part of test indexes[user:follow:item] for link prediction: {lp_test_follow_indexes[:3]}\")\n",
+        "np.save(lp_test_follow_indexes_path, lp_test_follow_indexes)\n",
+        "print(f\"LP test indexes[user:follow:item] are saved to {lp_test_follow_indexes_path}\\n\")"
+      ]
    },
    {
      "cell_type": "markdown",
+      "metadata": {
+        "id": "wbk6-wxRK-6S"
+      },
      "source": [
        "## Organize Data into YAML File\n",
        "Now we need to create a `metadata.yaml` file which contains the paths, dadta types of graph structure, feature data, training/validation/test sets. Please note that all path should be relative to `metadata.yaml`.\n",
@@ -457,13 +505,15 @@
        "  - `in_memory`: indicates whether to load dada into memory or `mmap`. Default is `True`.\n",
        "\n",
        "Please refer to [YAML specification](https://github.com/dmlc/dgl/blob/master/docs/source/stochastic_training/ondisk-dataset-specification.rst) for more details."
-      ],
-      "metadata": {
-        "id": "wbk6-wxRK-6S"
-      }
+      ]
    },
    {
      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "ddGTWW61Lpwp"
+      },
+      "outputs": [],
      "source": [
        "yaml_content = f\"\"\"\n",
        "    dataset_name: heterogeneous_graph_nc_lp\n",
@@ -527,7 +577,7 @@
        "        train_set:\n",
        "          - type: user\n",
        "            data:\n",
-        "              - name: seed_nodes\n",
+        "              - name: seeds\n",
        "                format: numpy\n",
        "                path: {os.path.basename(nc_train_user_ids_path)}\n",
        "              - name: labels\n",
@@ -535,7 +585,7 @@
        "                path: {os.path.basename(nc_train_user_labels_path)}\n",
        "          - type: item\n",
        "            data:\n",
-        "              - name: seed_nodes\n",
+        "              - name: seeds\n",
        "                format: numpy\n",
        "                path: {os.path.basename(nc_train_item_ids_path)}\n",
        "              - name: labels\n",
@@ -544,7 +594,7 @@
        "        validation_set:\n",
        "          - type: user\n",
        "            data:\n",
-        "              - name: seed_nodes\n",
+        "              - name: seeds\n",
        "                format: numpy\n",
        "                path: {os.path.basename(nc_val_user_ids_path)}\n",
        "              - name: labels\n",
@@ -552,7 +602,7 @@
        "                path: {os.path.basename(nc_val_user_labels_path)}\n",
        "          - type: item\n",
        "            data:\n",
-        "              - name: seed_nodes\n",
+        "              - name: seeds\n",
        "                format: numpy\n",
        "                path: {os.path.basename(nc_val_item_ids_path)}\n",
        "              - name: labels\n",
@@ -561,7 +611,7 @@
        "        test_set:\n",
        "          - type: user\n",
        "            data:\n",
-        "              - name: seed_nodes\n",
+        "              - name: seeds\n",
        "                format: numpy\n",
        "                path: {os.path.basename(nc_test_user_ids_path)}\n",
        "              - name: labels\n",
@@ -569,7 +619,7 @@
        "                path: {os.path.basename(nc_test_user_labels_path)}\n",
        "          - type: item\n",
        "            data:\n",
-        "              - name: seed_nodes\n",
+        "              - name: seeds\n",
        "                format: numpy\n",
        "                path: {os.path.basename(nc_test_item_ids_path)}\n",
        "              - name: labels\n",
@@ -580,61 +630,71 @@
        "        train_set:\n",
        "          - type: \"user:like:item\"\n",
        "            data:\n",
-        "              - name: node_pairs\n",
+        "              - name: seeds\n",
        "                format: numpy\n",
-        "                path: {os.path.basename(lp_train_like_node_pairs_path)}\n",
+        "                path: {os.path.basename(lp_train_like_seeds_path)}\n",
        "          - type: \"user:follow:user\"\n",
        "            data:\n",
-        "              - name: node_pairs\n",
+        "              - name: seeds\n",
        "                format: numpy\n",
-        "                path: {os.path.basename(lp_train_follow_node_pairs_path)}\n",
+        "                path: {os.path.basename(lp_train_follow_seeds_path)}\n",
        "        validation_set:\n",
        "          - type: \"user:like:item\"\n",
        "            data:\n",
-        "              - name: node_pairs\n",
+        "              - name: seeds\n",
        "                format: numpy\n",
-        "                path: {os.path.basename(lp_val_like_node_pairs_path)}\n",
-        "              - name: negative_dsts\n",
-        "                format: torch\n",
-        "                path: {os.path.basename(lp_val_like_neg_dsts_path)}\n",
+        "                path: {os.path.basename(lp_val_like_seeds_path)}\n",
+        "              - name: labels\n",
+        "                format: numpy\n",
+        "                path: {os.path.basename(lp_val_like_labels_path)}\n",
+        "              - name: indexes\n",
+        "                format: numpy\n",
+        "                path: {os.path.basename(lp_val_like_indexes_path)}\n",
        "          - type: \"user:follow:user\"\n",
        "            data:\n",
-        "              - name: node_pairs\n",
+        "              - name: seeds\n",
        "                format: numpy\n",
-        "                path: {os.path.basename(lp_val_follow_node_pairs_path)}\n",
-        "              - name: negative_dsts\n",
-        "                format: torch\n",
-        "                path: {os.path.basename(lp_val_follow_neg_dsts_path)}\n",
+        "                path: {os.path.basename(lp_val_follow_seeds_path)}\n",
+        "              - name: labels\n",
+        "                format: numpy\n",
+        "                path: {os.path.basename(lp_val_follow_labels_path)}\n",
+        "              - name: indexes\n",
+        "                format: numpy\n",
+        "                path: {os.path.basename(lp_val_follow_indexes_path)}\n",
        "        test_set:\n",
        "          - type: \"user:like:item\"\n",
        "            data:\n",
-        "              - name: node_pairs\n",
+        "              - name: seeds\n",
        "                format: numpy\n",
-        "                path: {os.path.basename(lp_test_like_node_pairs_path)}\n",
-        "              - name: negative_dsts\n",
-        "                format: torch\n",
-        "                path: {os.path.basename(lp_test_like_neg_dsts_path)}\n",
+        "                path: {os.path.basename(lp_test_like_seeds_path)}\n",
+        "              - name: labels\n",
+        "                format: numpy\n",
+        "                path: {os.path.basename(lp_test_like_labels_path)}\n",
+        "              - name: indexes\n",
+        "                format: numpy\n",
+        "                path: {os.path.basename(lp_test_like_indexes_path)}\n",
        "          - type: \"user:follow:user\"\n",
        "            data:\n",
-        "              - name: node_pairs\n",
+        "              - name: seeds\n",
        "                format: numpy\n",
-        "                path: {os.path.basename(lp_test_follow_node_pairs_path)}\n",
-        "              - name: negative_dsts\n",
-        "                format: torch\n",
-        "                path: {os.path.basename(lp_test_follow_neg_dsts_path)}\n",
+        "                path: {os.path.basename(lp_test_follow_seeds_path)}\n",
+        "              - name: labels\n",
+        "                format: numpy\n",
+        "                path: {os.path.basename(lp_test_follow_labels_path)}\n",
+        "              - name: indexes\n",
+        "                format: numpy\n",
+        "                path: {os.path.basename(lp_test_follow_indexes_path)}\n",
        "\"\"\"\n",
        "metadata_path = os.path.join(base_dir, \"metadata.yaml\")\n",
        "with open(metadata_path, \"w\") as f:\n",
        "  f.write(yaml_content)"
-      ],
-      "metadata": {
-        "id": "ddGTWW61Lpwp"
-      },
-      "execution_count": null,
-      "outputs": []
+      ]
    },
    {
      "cell_type": "markdown",
+      "metadata": {
+        "id": "kEfybHGhOW7O"
+      },
      "source": [
        "## Instantiate `OnDiskDataset`\n",
        "Now we're ready to load dataset via `dgl.graphbolt.OnDiskDataset`. When instantiating, we just pass in the base directory where `metadata.yaml` file lies.\n",
@@ -642,13 +702,15 @@
        "During first instantiation, GraphBolt preprocesses the raw data such as constructing `FusedCSCSamplingGraph` from edges. All data including graph, feature data, training/validation/test sets are put into `preprocessed` directory after preprocessing. Any following dataset loading will skip the preprocess stage.\n",
        "\n",
        "After preprocessing, `load()` is required to be called explicitly in order to load graph, feature data and tasks."
-      ],
-      "metadata": {
-        "id": "kEfybHGhOW7O"
-      }
+      ]
    },
    {
      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "W58CZoSzOiyo"
+      },
+      "outputs": [],
      "source": [
        "dataset = gb.OnDiskDataset(base_dir).load()\n",
        "graph = dataset.graph\n",
@@ -662,12 +724,31 @@
        "print(f\"Loaded node classification task: {nc_task}\\n\")\n",
        "lp_task = tasks[1]\n",
        "print(f\"Loaded link prediction task: {lp_task}\\n\")"
+      ]
+    }
  ],
  "metadata": {
-        "id": "W58CZoSzOiyo"
+    "colab": {
+      "private_outputs": true,
+      "provenance": []
    },
-      "execution_count": null,
-      "outputs": []
+    "kernelspec": {
+      "display_name": "Python 3",
+      "name": "python3"
+    },
+    "language_info": {
+      "codemirror_mode": {
+        "name": "ipython",
+        "version": 3
+      },
+      "file_extension": ".py",
+      "mimetype": "text/x-python",
+      "name": "python",
+      "nbconvert_exporter": "python",
+      "pygments_lexer": "ipython3",
+      "version": "3.10.12"
    }
-  ]
+  },
+  "nbformat": 4,
+  "nbformat_minor": 0
 }
--- a/notebooks/stochastic_training/ondisk_dataset_homograph.ipynb
+++ b/notebooks/stochastic_training/ondisk_dataset_homograph.ipynb
 {
-  "nbformat": 4,
-  "nbformat_minor": 0,
-  "metadata": {
-    "colab": {
-      "private_outputs": true,
-      "provenance": []
-    },
-    "kernelspec": {
-      "name": "python3",
-      "display_name": "Python 3"
-    },
-    "language_info": {
-      "name": "python"
-    }
-  },
  "cells": [
    {
      "cell_type": "markdown",
+      "metadata": {
+        "id": "FnFhPMaAfLtJ"
+      },
      "source": [
        "# OnDiskDataset for Homogeneous Graph\n",
        "\n",
@@ -33,22 +21,24 @@
        "To create an ``OnDiskDataset`` object, you need to organize all the data including graph structure, feature data and tasks into a directory. The directory should contain a ``metadata.yaml`` file that describes the metadata of the dataset.\n",
        "\n",
        "Now let's generate various data step by step and organize them together to instantiate `OnDiskDataset` finally."
-      ],
-      "metadata": {
-        "id": "FnFhPMaAfLtJ"
-      }
+      ]
    },
    {
      "cell_type": "markdown",
-      "source": [
-        "## Install DGL package"
-      ],
      "metadata": {
        "id": "Wlb19DtWgtzq"
-      }
+      },
+      "source": [
+        "## Install DGL package"
+      ]
    },
    {
      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "UojlT9ZGgyr9"
+      },
+      "outputs": [],
      "source": [
        "# Install required packages.\n",
        "import os\n",
@@ -69,52 +59,52 @@
        "    installed = False\n",
        "    print(error)\n",
        "print(\"DGL installed!\" if installed else \"DGL not found!\")"
-      ],
-      "metadata": {
-        "id": "UojlT9ZGgyr9"
-      },
-      "execution_count": null,
-      "outputs": []
+      ]
    },
    {
      "cell_type": "markdown",
+      "metadata": {
+        "id": "2R7WnSbjsfbr"
+      },
      "source": [
        "## Data preparation\n",
        "In order to demonstrate how to organize various data, let's create a base directory first."
-      ],
-      "metadata": {
-        "id": "2R7WnSbjsfbr"
-      }
+      ]
    },
    {
      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "SZipbzyltLfO"
+      },
+      "outputs": [],
      "source": [
        "base_dir = './ondisk_dataset_homograph'\n",
        "os.makedirs(base_dir, exist_ok=True)\n",
        "print(f\"Created base directory: {base_dir}\")"
-      ],
-      "metadata": {
-        "id": "SZipbzyltLfO"
-      },
-      "execution_count": null,
-      "outputs": []
+      ]
    },
    {
      "cell_type": "markdown",
+      "metadata": {
+        "id": "qhNtIn_xhlnl"
+      },
      "source": [
        "### Generate graph structure data\n",
-        "For homogeneous graph, we just need to save edges(namely node pairs) into  **Numpy** or **CSV** file.\n",
+        "For homogeneous graph, we just need to save edges(namely seeds) into  **Numpy** or **CSV** file.\n",
        "\n",
        "Note:\n",
        "- when saving to **Numpy**, the array requires to be in shape of `(2, N)`. This format is recommended as constructing graph from it is much faster than **CSV** file.\n",
        "- when saving to **CSV** file, do not save index and header.\n"
-      ],
-      "metadata": {
-        "id": "qhNtIn_xhlnl"
-      }
+      ]
    },
    {
      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "HcBt4G5BmSjr"
+      },
+      "outputs": [],
      "source": [
        "import numpy as np\n",
        "import pandas as pd\n",
@@ -129,25 +119,25 @@
        "df.to_csv(edges_path, index=False, header=False)\n",
        "\n",
        "print(f\"Edges are saved into {edges_path}\")"
-      ],
-      "metadata": {
-        "id": "HcBt4G5BmSjr"
-      },
-      "execution_count": null,
-      "outputs": []
+      ]
    },
    {
      "cell_type": "markdown",
+      "metadata": {
+        "id": "kh-4cPtzpcaH"
+      },
      "source": [
        "### Generate feature data for graph\n",
        "For feature data, numpy arrays and torch tensors are supported for now."
-      ],
-      "metadata": {
-        "id": "kh-4cPtzpcaH"
-      }
+      ]
    },
    {
      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "_PVu1u5brBhF"
+      },
+      "outputs": [],
      "source": [
        "# Generate node feature in numpy array.\n",
        "node_feat_0_path = os.path.join(base_dir, \"node-feat-0.npy\")\n",
@@ -176,35 +166,35 @@
        "print(f\"Part of edge feature [feat_1]: {edge_feat_1[:3, :]}\")\n",
        "torch.save(edge_feat_1, edge_feat_1_path)\n",
        "print(f\"Edge feature [feat_1] is saved to {edge_feat_1_path}\\n\")\n"
-      ],
-      "metadata": {
-        "id": "_PVu1u5brBhF"
-      },
-      "execution_count": null,
-      "outputs": []
+      ]
    },
    {
      "cell_type": "markdown",
+      "metadata": {
+        "id": "ZyqgOtsIwzh_"
+      },
      "source": [
        "### Generate tasks\n",
        "`OnDiskDataset` supports multiple tasks. For each task, we need to prepare training/validation/test sets respectively. Such sets usually vary among different tasks. In this tutorial, let's create a **Node Classification** task and **Link Prediction** task."
-      ],
-      "metadata": {
-        "id": "ZyqgOtsIwzh_"
-      }
+      ]
    },
    {
      "cell_type": "markdown",
+      "metadata": {
+        "id": "hVxHaDIfzCkr"
+      },
      "source": [
        "#### Node Classification Task\n",
        "For node classification task, we need **node IDs** and corresponding **labels** for each training/validation/test set. Like feature data, numpy arrays and torch tensors are supported for these sets."
-      ],
-      "metadata": {
-        "id": "hVxHaDIfzCkr"
-      }
+      ]
    },
    {
      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "S5-fyBbHzTCO"
+      },
+      "outputs": [],
      "source": [
        "num_trains = int(num_nodes * 0.6)\n",
        "num_vals = int(num_nodes * 0.2)\n",
@@ -248,68 +238,94 @@
        "print(f\"Part of test labels for node classification: {nc_test_labels[:3]}\")\n",
        "torch.save(nc_test_labels, nc_test_labels_path)\n",
        "print(f\"NC test labels are saved to {nc_test_labels_path}\\n\")"
-      ],
-      "metadata": {
-        "id": "S5-fyBbHzTCO"
-      },
-      "execution_count": null,
-      "outputs": []
+      ]
    },
    {
      "cell_type": "markdown",
-      "source": [
-        "#### Link Prediction Task\n",
-        "For link prediction task, we need **node pairs** or **negative src/dsts** for each training/validation/test set. Like feature data, numpy arrays and torch tensors are supported for these sets."
-      ],
      "metadata": {
        "id": "LhAcDCHQ_KJ0"
-      }
+      },
+      "source": [
+        "#### Link Prediction Task\n",
+        "For link prediction task, we need **seeds** or **corresponding labels and indexes** which representing the pos/neg property and group of the seeds for each training/validation/test set. Like feature data, numpy arrays and torch tensors are supported for these sets."
+      ]
    },
    {
      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "u0jCnXIcAQy4"
+      },
+      "outputs": [],
      "source": [
        "num_trains = int(num_edges * 0.6)\n",
        "num_vals = int(num_edges * 0.2)\n",
        "num_tests = num_edges - num_trains - num_vals\n",
        "\n",
-        "lp_train_node_pairs_path = os.path.join(base_dir, \"lp-train-node-pairs.npy\")\n",
-        "lp_train_node_pairs = edges[:num_trains, :]\n",
-        "print(f\"Part of train node pairs for link prediction: {lp_train_node_pairs[:3]}\")\n",
-        "np.save(lp_train_node_pairs_path, lp_train_node_pairs)\n",
-        "print(f\"LP train node pairs are saved to {lp_train_node_pairs_path}\\n\")\n",
-        "\n",
-        "lp_val_node_pairs_path = os.path.join(base_dir, \"lp-val-node-pairs.npy\")\n",
-        "lp_val_node_pairs = edges[num_trains:num_trains+num_vals, :]\n",
-        "print(f\"Part of val node pairs for link prediction: {lp_val_node_pairs[:3]}\")\n",
-        "np.save(lp_val_node_pairs_path, lp_val_node_pairs)\n",
-        "print(f\"LP val node pairs are saved to {lp_val_node_pairs_path}\\n\")\n",
-        "\n",
-        "lp_val_neg_dsts_path = os.path.join(base_dir, \"lp-val-neg-dsts.pt\")\n",
-        "lp_val_neg_dsts = torch.randint(0, num_nodes, (num_vals, 10))\n",
-        "print(f\"Part of val negative dsts for link prediction: {lp_val_neg_dsts[:3]}\")\n",
-        "torch.save(lp_val_neg_dsts, lp_val_neg_dsts_path)\n",
-        "print(f\"LP val negative dsts are saved to {lp_val_neg_dsts_path}\\n\")\n",
-        "\n",
-        "lp_test_node_pairs_path = os.path.join(base_dir, \"lp-test-node-pairs.npy\")\n",
-        "lp_test_node_pairs = edges[-num_tests:, :]\n",
-        "print(f\"Part of test node pairs for link prediction: {lp_test_node_pairs[:3]}\")\n",
-        "np.save(lp_test_node_pairs_path, lp_test_node_pairs)\n",
-        "print(f\"LP test node pairs are saved to {lp_test_node_pairs_path}\\n\")\n",
-        "\n",
-        "lp_test_neg_dsts_path = os.path.join(base_dir, \"lp-test-neg-dsts.pt\")\n",
-        "lp_test_neg_dsts = torch.randint(0, num_nodes, (num_tests, 10))\n",
-        "print(f\"Part of test negative dsts for link prediction: {lp_test_neg_dsts[:3]}\")\n",
-        "torch.save(lp_test_neg_dsts, lp_test_neg_dsts_path)\n",
-        "print(f\"LP test negative dsts are saved to {lp_test_neg_dsts_path}\\n\")"
-      ],
-      "metadata": {
-        "id": "u0jCnXIcAQy4"
-      },
-      "execution_count": null,
-      "outputs": []
+        "lp_train_seeds_path = os.path.join(base_dir, \"lp-train-seeds.npy\")\n",
+        "lp_train_seeds = edges[:num_trains, :]\n",
+        "print(f\"Part of train seeds for link prediction: {lp_train_seeds[:3]}\")\n",
+        "np.save(lp_train_seeds_path, lp_train_seeds)\n",
+        "print(f\"LP train seeds are saved to {lp_train_seeds_path}\\n\")\n",
+        "\n",
+        "lp_val_seeds_path = os.path.join(base_dir, \"lp-val-seeds.npy\")\n",
+        "lp_val_seeds = edges[num_trains:num_trains+num_vals, :]\n",
+        "lp_val_neg_dsts = np.random.randint(0, num_nodes, (num_vals, 10)).reshape(-1)\n",
+        "lp_val_neg_srcs = np.repeat(lp_val_seeds[:,0], 10)\n",
+        "lp_val_neg_seeds = np.concatenate((lp_val_neg_srcs, lp_val_neg_dsts)).reshape(2,-1).T\n",
+        "lp_val_seeds = np.concatenate((lp_val_seeds, lp_val_neg_seeds))\n",
+        "print(f\"Part of val seeds for link prediction: {lp_val_seeds[:3]}\")\n",
+        "np.save(lp_val_seeds_path, lp_val_seeds)\n",
+        "print(f\"LP val seeds are saved to {lp_val_seeds_path}\\n\")\n",
+        "\n",
+        "lp_val_labels_path = os.path.join(base_dir, \"lp-val-labels.npy\")\n",
+        "lp_val_labels = np.empty(num_vals * (10 + 1))\n",
+        "lp_val_labels[:num_vals] = 1\n",
+        "lp_val_labels[num_vals:] = 0\n",
+        "print(f\"Part of val labels for link prediction: {lp_val_labels[:3]}\")\n",
+        "np.save(lp_val_labels_path, lp_val_labels)\n",
+        "print(f\"LP val labels are saved to {lp_val_labels_path}\\n\")\n",
+        "\n",
+        "lp_val_indexes_path = os.path.join(base_dir, \"lp-val-indexes.npy\")\n",
+        "lp_val_indexes = np.arange(0, num_vals)\n",
+        "lp_val_neg_indexes = np.repeat(lp_val_indexes, 10)\n",
+        "lp_val_indexes = np.concatenate([lp_val_indexes, lp_val_neg_indexes])\n",
+        "print(f\"Part of val indexes for link prediction: {lp_val_indexes[:3]}\")\n",
+        "np.save(lp_val_indexes_path, lp_val_indexes)\n",
+        "print(f\"LP val indexes are saved to {lp_val_indexes_path}\\n\")\n",
+        "\n",
+        "lp_test_seeds_path = os.path.join(base_dir, \"lp-test-seeds.npy\")\n",
+        "lp_test_seeds = edges[-num_tests:, :]\n",
+        "lp_test_neg_dsts = np.random.randint(0, num_nodes, (num_tests, 10)).reshape(-1)\n",
+        "lp_test_neg_srcs = np.repeat(lp_test_seeds[:,0], 10)\n",
+        "lp_test_neg_seeds = np.concatenate((lp_test_neg_srcs, lp_test_neg_dsts)).reshape(2,-1).T\n",
+        "lp_test_seeds = np.concatenate((lp_test_seeds, lp_test_neg_seeds))\n",
+        "print(f\"Part of test seeds for link prediction: {lp_test_seeds[:3]}\")\n",
+        "np.save(lp_test_seeds_path, lp_test_seeds)\n",
+        "print(f\"LP test seeds are saved to {lp_test_seeds_path}\\n\")\n",
+        "\n",
+        "lp_test_labels_path = os.path.join(base_dir, \"lp-test-labels.npy\")\n",
+        "lp_test_labels = np.empty(num_tests * (10 + 1))\n",
+        "lp_test_labels[:num_tests] = 1\n",
+        "lp_test_labels[num_tests:] = 0\n",
+        "print(f\"Part of val labels for link prediction: {lp_test_labels[:3]}\")\n",
+        "np.save(lp_test_labels_path, lp_test_labels)\n",
+        "print(f\"LP test labels are saved to {lp_test_labels_path}\\n\")\n",
+        "\n",
+        "lp_test_indexes_path = os.path.join(base_dir, \"lp-test-indexes.npy\")\n",
+        "lp_test_indexes = np.arange(0, num_tests)\n",
+        "lp_test_neg_indexes = np.repeat(lp_test_indexes, 10)\n",
+        "lp_test_indexes = np.concatenate([lp_test_indexes, lp_test_neg_indexes])\n",
+        "print(f\"Part of test indexes for link prediction: {lp_test_indexes[:3]}\")\n",
+        "np.save(lp_test_indexes_path, lp_test_indexes)\n",
+        "print(f\"LP test indexes are saved to {lp_test_indexes_path}\\n\")"
+      ]
    },
    {
      "cell_type": "markdown",
+      "metadata": {
+        "id": "wbk6-wxRK-6S"
+      },
      "source": [
        "## Organize Data into YAML File\n",
        "Now we need to create a `metadata.yaml` file which contains the paths, dadta types of graph structure, feature data, training/validation/test sets.\n",
@@ -320,13 +336,15 @@
        "  - `in_memory`: indicates whether to load dada into memory or `mmap`. Default is `True`.\n",
        "\n",
        "Please refer to [YAML specification](https://github.com/dmlc/dgl/blob/master/docs/source/stochastic_training/ondisk-dataset-specification.rst) for more details."
-      ],
-      "metadata": {
-        "id": "wbk6-wxRK-6S"
-      }
+      ]
    },
    {
      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "ddGTWW61Lpwp"
+      },
+      "outputs": [],
      "source": [
        "yaml_content = f\"\"\"\n",
        "    dataset_name: homogeneous_graph_nc_lp\n",
@@ -358,7 +376,7 @@
        "        num_classes: 10\n",
        "        train_set:\n",
        "          - data:\n",
-        "              - name: seed_nodes\n",
+        "              - name: seeds\n",
        "                format: numpy\n",
        "                path: {os.path.basename(nc_train_ids_path)}\n",
        "              - name: labels\n",
@@ -366,7 +384,7 @@
        "                path: {os.path.basename(nc_train_labels_path)}\n",
        "        validation_set:\n",
        "          - data:\n",
-        "              - name: seed_nodes\n",
+        "              - name: seeds\n",
        "                format: numpy\n",
        "                path: {os.path.basename(nc_val_ids_path)}\n",
        "              - name: labels\n",
@@ -374,7 +392,7 @@
        "                path: {os.path.basename(nc_val_labels_path)}\n",
        "        test_set:\n",
        "          - data:\n",
-        "              - name: seed_nodes\n",
+        "              - name: seeds\n",
        "                format: numpy\n",
        "                path: {os.path.basename(nc_test_ids_path)}\n",
        "              - name: labels\n",
@@ -384,38 +402,42 @@
        "        num_classes: 10\n",
        "        train_set:\n",
        "          - data:\n",
-        "              - name: node_pairs\n",
+        "              - name: seeds\n",
        "                format: numpy\n",
-        "                path: {os.path.basename(lp_train_node_pairs_path)}\n",
+        "                path: {os.path.basename(lp_train_seeds_path)}\n",
        "        validation_set:\n",
        "          - data:\n",
-        "              - name: node_pairs\n",
+        "              - name: seeds\n",
        "                format: numpy\n",
-        "                path: {os.path.basename(lp_val_node_pairs_path)}\n",
-        "              - name: negative_dsts\n",
-        "                format: torch\n",
-        "                path: {os.path.basename(lp_val_neg_dsts_path)}\n",
+        "                path: {os.path.basename(lp_val_seeds_path)}\n",
+        "              - name: labels\n",
+        "                format: numpy\n",
+        "                path: {os.path.basename(lp_val_labels_path)}\n",
+        "              - name: indexes\n",
+        "                format: numpy\n",
+        "                path: {os.path.basename(lp_val_indexes_path)}\n",
        "        test_set:\n",
        "          - data:\n",
-        "              - name: node_pairs\n",
+        "              - name: seeds\n",
        "                format: numpy\n",
-        "                path: {os.path.basename(lp_test_node_pairs_path)}\n",
-        "              - name: negative_dsts\n",
-        "                format: torch\n",
-        "                path: {os.path.basename(lp_test_neg_dsts_path)}\n",
+        "                path: {os.path.basename(lp_test_seeds_path)}\n",
+        "              - name: labels\n",
+        "                format: numpy\n",
+        "                path: {os.path.basename(lp_test_labels_path)}\n",
+        "              - name: indexes\n",
+        "                format: numpy\n",
+        "                path: {os.path.basename(lp_test_indexes_path)}\n",
        "\"\"\"\n",
        "metadata_path = os.path.join(base_dir, \"metadata.yaml\")\n",
        "with open(metadata_path, \"w\") as f:\n",
        "  f.write(yaml_content)"
-      ],
-      "metadata": {
-        "id": "ddGTWW61Lpwp"
-      },
-      "execution_count": null,
-      "outputs": []
+      ]
    },
    {
      "cell_type": "markdown",
+      "metadata": {
+        "id": "kEfybHGhOW7O"
+      },
      "source": [
        "## Instantiate `OnDiskDataset`\n",
        "Now we're ready to load dataset via `dgl.graphbolt.OnDiskDataset`. When instantiating, we just pass in the base directory where `metadata.yaml` file lies.\n",
@@ -423,13 +445,15 @@
        "During first instantiation, GraphBolt preprocesses the raw data such as constructing `FusedCSCSamplingGraph` from edges. All data including graph, feature data, training/validation/test sets are put into `preprocessed` directory after preprocessing. Any following dataset loading will skip the preprocess stage.\n",
        "\n",
        "After preprocessing, `load()` is required to be called explicitly in order to load graph, feature data and tasks."
-      ],
-      "metadata": {
-        "id": "kEfybHGhOW7O"
-      }
+      ]
    },
    {
      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "W58CZoSzOiyo"
+      },
+      "outputs": [],
      "source": [
        "dataset = gb.OnDiskDataset(base_dir).load()\n",
        "graph = dataset.graph\n",
@@ -443,12 +467,31 @@
        "print(f\"Loaded node classification task: {nc_task}\\n\")\n",
        "lp_task = tasks[1]\n",
        "print(f\"Loaded link prediction task: {lp_task}\\n\")"
+      ]
+    }
  ],
  "metadata": {
-        "id": "W58CZoSzOiyo"
+    "colab": {
+      "private_outputs": true,
+      "provenance": []
    },
-      "execution_count": null,
-      "outputs": []
+    "kernelspec": {
+      "display_name": "Python 3",
+      "name": "python3"
+    },
+    "language_info": {
+      "codemirror_mode": {
+        "name": "ipython",
+        "version": 3
+      },
+      "file_extension": ".py",
+      "mimetype": "text/x-python",
+      "name": "python",
+      "nbconvert_exporter": "python",
+      "pygments_lexer": "ipython3",
+      "version": "3.10.12"
    }
-  ]
+  },
+  "nbformat": 4,
+  "nbformat_minor": 0
 }