[GraphBolt] update tutorial for OnDiskDataset homograph (#6776)

fda8fd22 · Rhett Ying · GitHub · 11bdd6e8 · fda8fd22 · fda8fd22
Unverified Commit fda8fd22 authored Dec 19, 2023 by Rhett Ying Committed by GitHub Dec 19, 2023
2 changed files
--- a/docs/source/stochastic_training/ondisk-dataset.rst
+++ b/docs/source/stochastic_training/ondisk-dataset.rst
@@ -6,6 +6,11 @@ Creating OnDiskDataset
 This tutorial shows how to create an `OnDiskDataset` from raw data and use it
 for stochastic training.

+**GraphBolt** provides the ``OnDiskDataset`` class to help user organize plain
+data of graph strucutre, feature data and tasks. ``OnDiskDataset`` is also
+designed to efficiently handle large graphs and features that do not fit into
+memory by storing them on disk.
+
 For more details about `OnDiskDataset`, please refer to the
 :class:`~dgl.graphbolt.OnDiskDataset` API documentation.


--- a/notebooks/stochastic_training/ondisk_dataset_homograph.ipynb
+++ b/notebooks/stochastic_training/ondisk_dataset_homograph.ipynb
@@ -4,8 +4,7 @@
  "metadata": {
    "colab": {
      "private_outputs": true,
-      "provenance": [],
-      "authorship_tag": "ABX9TyMnOgpk68ZvpOQVFBgDxDof"
+      "provenance": []
    },
    "kernelspec": {
      "name": "python3",
@@ -28,7 +27,11 @@
        "By the end of this tutorial, you will be able to\n",
        "- organize graph structure data.\n",
        "- organize feature data.\n",
-        "- organize training/validation/test set for specific tasks."
+        "- organize training/validation/test set for specific tasks.\n",
+        "\n",
+        "To create an ``OnDiskDataset`` object, you need to organize all the data including graph structure, feature data and tasks into a directory. The directory should contain a ``metadata.yaml`` file that describes the metadata of the dataset.\n",
+        "\n",
+        "Now let's generate various data step by step and organize them together to instantiate `OnDiskDataset` finally."
      ],
      "metadata": {
        "id": "FnFhPMaAfLtJ"
@@ -71,6 +74,387 @@
      },
      "execution_count": null,
      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "## Data preparation\n",
+        "In order to demonstrate how to organize various data, let's create a base directory first."
+      ],
+      "metadata": {
+        "id": "2R7WnSbjsfbr"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "base_dir = './ondisk_dataset_homograph'\n",
+        "os.makedirs(base_dir, exist_ok=True)\n",
+        "print(f\"Created base directory: {base_dir}\")"
+      ],
+      "metadata": {
+        "id": "SZipbzyltLfO"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "### Generate graph structure data\n",
+        "For homogeneous graph, we just need to save edges(namely node pairs) into **CSV** file.\n",
+        "\n",
+        "Note:\n",
+        "when saving to file, do not save index and header.*italicized text*\n"
+      ],
+      "metadata": {
+        "id": "qhNtIn_xhlnl"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "import numpy as np\n",
+        "import pandas as pd\n",
+        "num_nodes = 1000\n",
+        "num_edges = 10 * num_nodes\n",
+        "edges_path = os.path.join(base_dir, \"edges.csv\")\n",
+        "edges = np.random.randint(0, num_nodes, size=(num_edges, 2))\n",
+        "\n",
+        "print(f\"Part of edges: {edges[:10, :]}\")\n",
+        "\n",
+        "df = pd.DataFrame(edges)\n",
+        "df.to_csv(edges_path, index=False, header=False)\n",
+        "\n",
+        "print(f\"Edges are saved into {edges_path}\")"
+      ],
+      "metadata": {
+        "id": "HcBt4G5BmSjr"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "### Generate feature data for graph\n",
+        "For feature data, numpy arrays and torch tensors are supported for now."
+      ],
+      "metadata": {
+        "id": "kh-4cPtzpcaH"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "# Generate node feature in numpy array.\n",
+        "node_feat_0_path = os.path.join(base_dir, \"node-feat-0.npy\")\n",
+        "node_feat_0 = np.random.rand(num_nodes, 5)\n",
+        "print(f\"Part of node feature [feat_0]: {node_feat_0[:10, :]}\")\n",
+        "np.save(node_feat_0_path, node_feat_0)\n",
+        "print(f\"Node feature [feat_0] is saved to {node_feat_0_path}\")\n",
+        "\n",
+        "# Generate another node feature in torch tensor\n",
+        "node_feat_1_path = os.path.join(base_dir, \"node-feat-1.pt\")\n",
+        "node_feat_1 = torch.rand(num_nodes, 5)\n",
+        "print(f\"Part of node feature [feat_1]: {node_feat_1[:10, :]}\")\n",
+        "torch.save(node_feat_1, node_feat_1_path)\n",
+        "print(f\"Node feature [feat_1] is saved to {node_feat_1_path}\")\n",
+        "\n",
+        "# Generate edge feature in numpy array.\n",
+        "edge_feat_0_path = os.path.join(base_dir, \"edge-feat-0.npy\")\n",
+        "edge_feat_0 = np.random.rand(num_edges, 5)\n",
+        "print(f\"Part of edge feature [feat_0]: {edge_feat_0[:10, :]}\")\n",
+        "np.save(edge_feat_0_path, edge_feat_0)\n",
+        "print(f\"Edge feature [feat_0] is saved to {edge_feat_0_path}\")\n",
+        "\n",
+        "# Generate another edge feature in torch tensor\n",
+        "edge_feat_1_path = os.path.join(base_dir, \"edge-feat-1.pt\")\n",
+        "edge_feat_1 = torch.rand(num_edges, 5)\n",
+        "print(f\"Part of edge feature [feat_1]: {edge_feat_1[:10, :]}\")\n",
+        "torch.save(edge_feat_1, edge_feat_1_path)\n",
+        "print(f\"Edge feature [feat_1] is saved to {edge_feat_1_path}\")\n"
+      ],
+      "metadata": {
+        "id": "_PVu1u5brBhF"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "### Generate tasks\n",
+        "`OnDiskDataset` supports multiple tasks. For each task, we need to prepare training/validation/test sets respectively. Such sets usually vary among different tasks. In this tutorial, let's create a **Node Classification** task and **Link Prediction** task."
+      ],
+      "metadata": {
+        "id": "ZyqgOtsIwzh_"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "#### Node Classification Task\n",
+        "For node classification task, we need **node IDs** and corresponding **labels** for each training/validation/test set. Like feature data, numpy arrays and torch tensors are supported for these sets."
+      ],
+      "metadata": {
+        "id": "hVxHaDIfzCkr"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "num_trains = int(num_nodes * 0.6)\n",
+        "num_vals = int(num_nodes * 0.2)\n",
+        "num_tests = num_nodes - num_trains - num_vals\n",
+        "\n",
+        "ids = np.arange(num_nodes)\n",
+        "np.random.shuffle(ids)\n",
+        "\n",
+        "nc_train_ids_path = os.path.join(base_dir, \"nc-train-ids.npy\")\n",
+        "nc_train_ids = ids[:num_trains]\n",
+        "print(f\"Part of train ids for node classification: {nc_train_ids[:10]}\")\n",
+        "np.save(nc_train_ids_path, nc_train_ids)\n",
+        "print(f\"NC train ids are saved to {nc_train_ids_path}\")\n",
+        "\n",
+        "nc_train_labels_path = os.path.join(base_dir, \"nc-train-labels.pt\")\n",
+        "nc_train_labels = torch.randint(0, 10, (num_trains,))\n",
+        "print(f\"Part of train labels for node classification: {nc_train_labels[:10]}\")\n",
+        "torch.save(nc_train_labels, nc_train_labels_path)\n",
+        "print(f\"NC train labels are saved to {nc_train_labels_path}\")\n",
+        "\n",
+        "nc_val_ids_path = os.path.join(base_dir, \"nc-val-ids.npy\")\n",
+        "nc_val_ids = ids[num_trains:num_trains+num_vals]\n",
+        "print(f\"Part of val ids for node classification: {nc_val_ids[:10]}\")\n",
+        "np.save(nc_val_ids_path, nc_val_ids)\n",
+        "print(f\"NC val ids are saved to {nc_val_ids_path}\")\n",
+        "\n",
+        "nc_val_labels_path = os.path.join(base_dir, \"nc-val-labels.pt\")\n",
+        "nc_val_labels = torch.randint(0, 10, (num_vals,))\n",
+        "print(f\"Part of val labels for node classification: {nc_val_labels[:10]}\")\n",
+        "torch.save(nc_val_labels, nc_val_labels_path)\n",
+        "print(f\"NC val labels are saved to {nc_val_labels_path}\")\n",
+        "\n",
+        "nc_test_ids_path = os.path.join(base_dir, \"nc-test-ids.npy\")\n",
+        "nc_test_ids = ids[-num_tests:]\n",
+        "print(f\"Part of test ids for node classification: {nc_test_ids[:10]}\")\n",
+        "np.save(nc_test_ids_path, nc_test_ids)\n",
+        "print(f\"NC test ids are saved to {nc_test_ids_path}\")\n",
+        "\n",
+        "nc_test_labels_path = os.path.join(base_dir, \"nc-test-labels.pt\")\n",
+        "nc_test_labels = torch.randint(0, 10, (num_tests,))\n",
+        "print(f\"Part of test labels for node classification: {nc_test_labels[:10]}\")\n",
+        "torch.save(nc_test_labels, nc_test_labels_path)\n",
+        "print(f\"NC test labels are saved to {nc_test_labels_path}\")"
+      ],
+      "metadata": {
+        "id": "S5-fyBbHzTCO"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "#### Link Prediction Task\n",
+        "For link prediction task, we need **node pairs** or **negative src/dsts** for each training/validation/test set. Like feature data, numpy arrays and torch tensors are supported for these sets."
+      ],
+      "metadata": {
+        "id": "LhAcDCHQ_KJ0"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "num_trains = int(num_edges * 0.6)\n",
+        "num_vals = int(num_edges * 0.2)\n",
+        "num_tests = num_edges - num_trains - num_vals\n",
+        "\n",
+        "lp_train_node_pairs_path = os.path.join(base_dir, \"lp-train-node-pairs.npy\")\n",
+        "lp_train_node_pairs = edges[:num_trains, :]\n",
+        "print(f\"Part of train node pairs for link prediction: {lp_train_node_pairs[:10]}\")\n",
+        "np.save(lp_train_node_pairs_path, lp_train_node_pairs)\n",
+        "print(f\"LP train node pairs are saved to {lp_train_node_pairs_path}\")\n",
+        "\n",
+        "lp_val_node_pairs_path = os.path.join(base_dir, \"lp-val-node-pairs.npy\")\n",
+        "lp_val_node_pairs = edges[num_trains:num_trains+num_vals, :]\n",
+        "print(f\"Part of val node pairs for link prediction: {lp_val_node_pairs[:10]}\")\n",
+        "np.save(lp_val_node_pairs_path, lp_val_node_pairs)\n",
+        "print(f\"LP val node pairs are saved to {lp_val_node_pairs_path}\")\n",
+        "\n",
+        "lp_val_neg_dsts_path = os.path.join(base_dir, \"lp-val-neg-dsts.pt\")\n",
+        "lp_val_neg_dsts = torch.randint(0, num_nodes, (num_vals, 10))\n",
+        "print(f\"Part of val negative dsts for link prediction: {lp_val_neg_dsts[:10]}\")\n",
+        "torch.save(lp_val_neg_dsts, lp_val_neg_dsts_path)\n",
+        "print(f\"LP val negative dsts are saved to {lp_val_neg_dsts_path}\")\n",
+        "\n",
+        "lp_test_node_pairs_path = os.path.join(base_dir, \"lp-test-node-pairs.npy\")\n",
+        "lp_test_node_pairs = edges[-num_tests, :]\n",
+        "print(f\"Part of test node pairs for link prediction: {lp_test_node_pairs[:10]}\")\n",
+        "np.save(lp_test_node_pairs_path, lp_test_node_pairs)\n",
+        "print(f\"LP test node pairs are saved to {lp_test_node_pairs_path}\")\n",
+        "\n",
+        "lp_test_neg_dsts_path = os.path.join(base_dir, \"lp-test-neg-dsts.pt\")\n",
+        "lp_test_neg_dsts = torch.randint(0, num_nodes, (num_tests, 10))\n",
+        "print(f\"Part of test negative dsts for link prediction: {lp_test_neg_dsts[:10]}\")\n",
+        "torch.save(lp_test_neg_dsts, lp_test_neg_dsts_path)\n",
+        "print(f\"LP test negative dsts are saved to {lp_test_neg_dsts_path}\")"
+      ],
+      "metadata": {
+        "id": "u0jCnXIcAQy4"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "## Organize Data into YAML File\n",
+        "Now we need to create a `metadata.yaml` file which contains the paths, dadta types of graph structure, feature data, training/validation/test sets. Please note that all path should be relative to `metadata.yaml`."
+      ],
+      "metadata": {
+        "id": "wbk6-wxRK-6S"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "yaml_content = f\"\"\"\n",
+        "    dataset_name: homogeneous_graph_nc_lp\n",
+        "    graph:\n",
+        "      nodes:\n",
+        "        - num: {num_nodes}\n",
+        "      edges:\n",
+        "        - format: csv\n",
+        "          path: {os.path.basename(edges_path)}\n",
+        "    feature_data:\n",
+        "      - domain: node\n",
+        "        name: feat_0\n",
+        "        format: numpy\n",
+        "        in_memory: true\n",
+        "        path: {os.path.basename(node_feat_0_path)}\n",
+        "      - domain: node\n",
+        "        name: feat_1\n",
+        "        format: torch\n",
+        "        in_memory: true\n",
+        "        path: {os.path.basename(node_feat_1_path)}\n",
+        "      - domain: edge\n",
+        "        name: feat_0\n",
+        "        format: numpy\n",
+        "        in_memory: true\n",
+        "        path: {os.path.basename(edge_feat_0_path)}\n",
+        "      - domain: edge\n",
+        "        name: feat_1\n",
+        "        format: torch\n",
+        "        in_memory: true\n",
+        "        path: {os.path.basename(edge_feat_1_path)}\n",
+        "    tasks:\n",
+        "      - name: node_classification\n",
+        "        num_classes: 10\n",
+        "        train_set:\n",
+        "          - data:\n",
+        "              - name: seed_nodes\n",
+        "                format: numpy\n",
+        "                in_memory: true\n",
+        "                path: {os.path.basename(nc_train_ids_path)}\n",
+        "              - name: labels\n",
+        "                format: torch\n",
+        "                in_memory: true\n",
+        "                path: {os.path.basename(nc_train_labels_path)}\n",
+        "        validation_set:\n",
+        "          - data:\n",
+        "              - name: seed_nodes\n",
+        "                format: numpy\n",
+        "                in_memory: true\n",
+        "                path: {os.path.basename(nc_val_ids_path)}\n",
+        "              - name: labels\n",
+        "                format: torch\n",
+        "                in_memory: true\n",
+        "                path: {os.path.basename(nc_val_labels_path)}\n",
+        "        test_set:\n",
+        "          - data:\n",
+        "              - name: seed_nodes\n",
+        "                format: numpy\n",
+        "                in_memory: true\n",
+        "                path: {os.path.basename(nc_test_ids_path)}\n",
+        "              - name: labels\n",
+        "                format: torch\n",
+        "                in_memory: true\n",
+        "                path: {os.path.basename(nc_test_labels_path)}\n",
+        "      - name: link_prediction\n",
+        "        num_classes: 10\n",
+        "        train_set:\n",
+        "          - data:\n",
+        "              - name: node_pairs\n",
+        "                format: numpy\n",
+        "                in_memory: true\n",
+        "                path: {os.path.basename(lp_train_node_pairs_path)}\n",
+        "        validation_set:\n",
+        "          - data:\n",
+        "              - name: node_pairs\n",
+        "                format: numpy\n",
+        "                in_memory: true\n",
+        "                path: {os.path.basename(lp_val_node_pairs_path)}\n",
+        "              - name: negative_dsts\n",
+        "                format: torch\n",
+        "                in_memory: true\n",
+        "                path: {os.path.basename(lp_val_neg_dsts_path)}\n",
+        "        test_set:\n",
+        "          - data:\n",
+        "              - name: node_pairs\n",
+        "                format: numpy\n",
+        "                in_memory: true\n",
+        "                path: {os.path.basename(lp_test_node_pairs_path)}\n",
+        "              - name: negative_dsts\n",
+        "                format: torch\n",
+        "                in_memory: true\n",
+        "                path: {os.path.basename(lp_test_neg_dsts_path)}\n",
+        "\"\"\"\n",
+        "metadata_path = os.path.join(base_dir, \"metadata.yaml\")\n",
+        "with open(metadata_path, \"w\") as f:\n",
+        "  f.write(yaml_content)"
+      ],
+      "metadata": {
+        "id": "ddGTWW61Lpwp"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "## Instantiate `OnDiskDataset`\n",
+        "Now we're ready to load dataset via `dgl.graphbolt.OnDiskDataset`. When instantiating, we just pass in the base directory where `metadata.yaml` file lies.\n",
+        "\n",
+        "During first instantiation, GraphBolt preprocesses the raw data such as constructing `FusedCSCSamplingGraph` from edges. All data including graph, feature data, training/validation/test sets are put into `preprocessed` directory after preprocessing. Any following dataset loading will skip the preprocess stage.\n",
+        "\n",
+        "After preprocessing, `load()` is required to be called explicitly in order to load graph, feature data and tasks."
+      ],
+      "metadata": {
+        "id": "kEfybHGhOW7O"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "dataset = gb.OnDiskDataset(base_dir).load()\n",
+        "graph = dataset.graph\n",
+        "print(f\"Loaded graph: {graph}\")\n",
+        "\n",
+        "feature = dataset.feature\n",
+        "print(f\"Loaded feature store: {feature}\")\n",
+        "\n",
+        "tasks = dataset.tasks\n",
+        "nc_task = tasks[0]\n",
+        "print(f\"Loaded node classification task: {nc_task}\")\n",
+        "lp_task = tasks[1]\n",
+        "print(f\"Loaded link prediction task: {lp_task}\")"
+      ],
+      "metadata": {
+        "id": "W58CZoSzOiyo"
+      },
+      "execution_count": null,
+      "outputs": []
    }
  ]
-}
+}
\ No newline at end of file