{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "FnFhPMaAfLtJ" }, "source": [ "# OnDiskDataset for Homogeneous Graph\n", "\n", "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/dmlc/dgl/blob/master/notebooks/stochastic_training/ondisk_dataset_homograph.ipynb) [![GitHub](https://img.shields.io/badge/-View%20on%20GitHub-181717?logo=github&logoColor=ffffff)](https://github.com/dmlc/dgl/blob/master/notebooks/stochastic_training/ondisk_dataset_homograph.ipynb)\n", "\n", "This tutorial shows how to create `OnDiskDataset` for homogeneous graph that could be used in **GraphBolt** framework.\n", "\n", "By the end of this tutorial, you will be able to\n", "\n", "- organize graph structure data.\n", "- organize feature data.\n", "- organize training/validation/test set for specific tasks.\n", "\n", "To create an ``OnDiskDataset`` object, you need to organize all the data including graph structure, feature data and tasks into a directory. The directory should contain a ``metadata.yaml`` file that describes the metadata of the dataset.\n", "\n", "Now let's generate various data step by step and organize them together to instantiate `OnDiskDataset` finally." ] }, { "cell_type": "markdown", "metadata": { "id": "Wlb19DtWgtzq" }, "source": [ "## Install DGL package" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "UojlT9ZGgyr9" }, "outputs": [], "source": [ "# Install required packages.\n", "import os\n", "import torch\n", "import numpy as np\n", "os.environ['TORCH'] = torch.__version__\n", "os.environ['DGLBACKEND'] = \"pytorch\"\n", "\n", "# Install the CPU version.\n", "device = torch.device(\"cpu\")\n", "!pip install --pre dgl -f https://data.dgl.ai/wheels-test/repo.html\n", "\n", "try:\n", " import dgl\n", " import dgl.graphbolt as gb\n", " installed = True\n", "except ImportError as error:\n", " installed = False\n", " print(error)\n", "print(\"DGL installed!\" if installed else \"DGL not found!\")" ] }, { "cell_type": "markdown", "metadata": { "id": "2R7WnSbjsfbr" }, "source": [ "## Data preparation\n", "In order to demonstrate how to organize various data, let's create a base directory first." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "SZipbzyltLfO" }, "outputs": [], "source": [ "base_dir = './ondisk_dataset_homograph'\n", "os.makedirs(base_dir, exist_ok=True)\n", "print(f\"Created base directory: {base_dir}\")" ] }, { "cell_type": "markdown", "metadata": { "id": "qhNtIn_xhlnl" }, "source": [ "### Generate graph structure data\n", "For homogeneous graph, we just need to save edges(namely seeds) into **Numpy** or **CSV** file.\n", "\n", "Note:\n", "- when saving to **Numpy**, the array requires to be in shape of `(2, N)`. This format is recommended as constructing graph from it is much faster than **CSV** file.\n", "- when saving to **CSV** file, do not save index and header.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "HcBt4G5BmSjr" }, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "num_nodes = 1000\n", "num_edges = 10 * num_nodes\n", "edges_path = os.path.join(base_dir, \"edges.csv\")\n", "edges = np.random.randint(0, num_nodes, size=(num_edges, 2))\n", "\n", "print(f\"Part of edges: {edges[:5, :]}\")\n", "\n", "df = pd.DataFrame(edges)\n", "df.to_csv(edges_path, index=False, header=False)\n", "\n", "print(f\"Edges are saved into {edges_path}\")" ] }, { "cell_type": "markdown", "metadata": { "id": "kh-4cPtzpcaH" }, "source": [ "### Generate feature data for graph\n", "For feature data, numpy arrays and torch tensors are supported for now." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "_PVu1u5brBhF" }, "outputs": [], "source": [ "# Generate node feature in numpy array.\n", "node_feat_0_path = os.path.join(base_dir, \"node-feat-0.npy\")\n", "node_feat_0 = np.random.rand(num_nodes, 5)\n", "print(f\"Part of node feature [feat_0]: {node_feat_0[:3, :]}\")\n", "np.save(node_feat_0_path, node_feat_0)\n", "print(f\"Node feature [feat_0] is saved to {node_feat_0_path}\\n\")\n", "\n", "# Generate another node feature in torch tensor\n", "node_feat_1_path = os.path.join(base_dir, \"node-feat-1.pt\")\n", "node_feat_1 = torch.rand(num_nodes, 5)\n", "print(f\"Part of node feature [feat_1]: {node_feat_1[:3, :]}\")\n", "torch.save(node_feat_1, node_feat_1_path)\n", "print(f\"Node feature [feat_1] is saved to {node_feat_1_path}\\n\")\n", "\n", "# Generate edge feature in numpy array.\n", "edge_feat_0_path = os.path.join(base_dir, \"edge-feat-0.npy\")\n", "edge_feat_0 = np.random.rand(num_edges, 5)\n", "print(f\"Part of edge feature [feat_0]: {edge_feat_0[:3, :]}\")\n", "np.save(edge_feat_0_path, edge_feat_0)\n", "print(f\"Edge feature [feat_0] is saved to {edge_feat_0_path}\\n\")\n", "\n", "# Generate another edge feature in torch tensor\n", "edge_feat_1_path = os.path.join(base_dir, \"edge-feat-1.pt\")\n", "edge_feat_1 = torch.rand(num_edges, 5)\n", "print(f\"Part of edge feature [feat_1]: {edge_feat_1[:3, :]}\")\n", "torch.save(edge_feat_1, edge_feat_1_path)\n", "print(f\"Edge feature [feat_1] is saved to {edge_feat_1_path}\\n\")\n" ] }, { "cell_type": "markdown", "metadata": { "id": "ZyqgOtsIwzh_" }, "source": [ "### Generate tasks\n", "`OnDiskDataset` supports multiple tasks. For each task, we need to prepare training/validation/test sets respectively. Such sets usually vary among different tasks. In this tutorial, let's create a **Node Classification** task and **Link Prediction** task." ] }, { "cell_type": "markdown", "metadata": { "id": "hVxHaDIfzCkr" }, "source": [ "#### Node Classification Task\n", "For node classification task, we need **node IDs** and corresponding **labels** for each training/validation/test set. Like feature data, numpy arrays and torch tensors are supported for these sets." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "S5-fyBbHzTCO" }, "outputs": [], "source": [ "num_trains = int(num_nodes * 0.6)\n", "num_vals = int(num_nodes * 0.2)\n", "num_tests = num_nodes - num_trains - num_vals\n", "\n", "ids = np.arange(num_nodes)\n", "np.random.shuffle(ids)\n", "\n", "nc_train_ids_path = os.path.join(base_dir, \"nc-train-ids.npy\")\n", "nc_train_ids = ids[:num_trains]\n", "print(f\"Part of train ids for node classification: {nc_train_ids[:3]}\")\n", "np.save(nc_train_ids_path, nc_train_ids)\n", "print(f\"NC train ids are saved to {nc_train_ids_path}\\n\")\n", "\n", "nc_train_labels_path = os.path.join(base_dir, \"nc-train-labels.pt\")\n", "nc_train_labels = torch.randint(0, 10, (num_trains,))\n", "print(f\"Part of train labels for node classification: {nc_train_labels[:3]}\")\n", "torch.save(nc_train_labels, nc_train_labels_path)\n", "print(f\"NC train labels are saved to {nc_train_labels_path}\\n\")\n", "\n", "nc_val_ids_path = os.path.join(base_dir, \"nc-val-ids.npy\")\n", "nc_val_ids = ids[num_trains:num_trains+num_vals]\n", "print(f\"Part of val ids for node classification: {nc_val_ids[:3]}\")\n", "np.save(nc_val_ids_path, nc_val_ids)\n", "print(f\"NC val ids are saved to {nc_val_ids_path}\\n\")\n", "\n", "nc_val_labels_path = os.path.join(base_dir, \"nc-val-labels.pt\")\n", "nc_val_labels = torch.randint(0, 10, (num_vals,))\n", "print(f\"Part of val labels for node classification: {nc_val_labels[:3]}\")\n", "torch.save(nc_val_labels, nc_val_labels_path)\n", "print(f\"NC val labels are saved to {nc_val_labels_path}\\n\")\n", "\n", "nc_test_ids_path = os.path.join(base_dir, \"nc-test-ids.npy\")\n", "nc_test_ids = ids[-num_tests:]\n", "print(f\"Part of test ids for node classification: {nc_test_ids[:3]}\")\n", "np.save(nc_test_ids_path, nc_test_ids)\n", "print(f\"NC test ids are saved to {nc_test_ids_path}\\n\")\n", "\n", "nc_test_labels_path = os.path.join(base_dir, \"nc-test-labels.pt\")\n", "nc_test_labels = torch.randint(0, 10, (num_tests,))\n", "print(f\"Part of test labels for node classification: {nc_test_labels[:3]}\")\n", "torch.save(nc_test_labels, nc_test_labels_path)\n", "print(f\"NC test labels are saved to {nc_test_labels_path}\\n\")" ] }, { "cell_type": "markdown", "metadata": { "id": "LhAcDCHQ_KJ0" }, "source": [ "#### Link Prediction Task\n", "For link prediction task, we need **seeds** or **corresponding labels and indexes** which representing the pos/neg property and group of the seeds for each training/validation/test set. Like feature data, numpy arrays and torch tensors are supported for these sets." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "u0jCnXIcAQy4" }, "outputs": [], "source": [ "num_trains = int(num_edges * 0.6)\n", "num_vals = int(num_edges * 0.2)\n", "num_tests = num_edges - num_trains - num_vals\n", "\n", "lp_train_seeds_path = os.path.join(base_dir, \"lp-train-seeds.npy\")\n", "lp_train_seeds = edges[:num_trains, :]\n", "print(f\"Part of train seeds for link prediction: {lp_train_seeds[:3]}\")\n", "np.save(lp_train_seeds_path, lp_train_seeds)\n", "print(f\"LP train seeds are saved to {lp_train_seeds_path}\\n\")\n", "\n", "lp_val_seeds_path = os.path.join(base_dir, \"lp-val-seeds.npy\")\n", "lp_val_seeds = edges[num_trains:num_trains+num_vals, :]\n", "lp_val_neg_dsts = np.random.randint(0, num_nodes, (num_vals, 10)).reshape(-1)\n", "lp_val_neg_srcs = np.repeat(lp_val_seeds[:,0], 10)\n", "lp_val_neg_seeds = np.concatenate((lp_val_neg_srcs, lp_val_neg_dsts)).reshape(2,-1).T\n", "lp_val_seeds = np.concatenate((lp_val_seeds, lp_val_neg_seeds))\n", "print(f\"Part of val seeds for link prediction: {lp_val_seeds[:3]}\")\n", "np.save(lp_val_seeds_path, lp_val_seeds)\n", "print(f\"LP val seeds are saved to {lp_val_seeds_path}\\n\")\n", "\n", "lp_val_labels_path = os.path.join(base_dir, \"lp-val-labels.npy\")\n", "lp_val_labels = np.empty(num_vals * (10 + 1))\n", "lp_val_labels[:num_vals] = 1\n", "lp_val_labels[num_vals:] = 0\n", "print(f\"Part of val labels for link prediction: {lp_val_labels[:3]}\")\n", "np.save(lp_val_labels_path, lp_val_labels)\n", "print(f\"LP val labels are saved to {lp_val_labels_path}\\n\")\n", "\n", "lp_val_indexes_path = os.path.join(base_dir, \"lp-val-indexes.npy\")\n", "lp_val_indexes = np.arange(0, num_vals)\n", "lp_val_neg_indexes = np.repeat(lp_val_indexes, 10)\n", "lp_val_indexes = np.concatenate([lp_val_indexes, lp_val_neg_indexes])\n", "print(f\"Part of val indexes for link prediction: {lp_val_indexes[:3]}\")\n", "np.save(lp_val_indexes_path, lp_val_indexes)\n", "print(f\"LP val indexes are saved to {lp_val_indexes_path}\\n\")\n", "\n", "lp_test_seeds_path = os.path.join(base_dir, \"lp-test-seeds.npy\")\n", "lp_test_seeds = edges[-num_tests:, :]\n", "lp_test_neg_dsts = np.random.randint(0, num_nodes, (num_tests, 10)).reshape(-1)\n", "lp_test_neg_srcs = np.repeat(lp_test_seeds[:,0], 10)\n", "lp_test_neg_seeds = np.concatenate((lp_test_neg_srcs, lp_test_neg_dsts)).reshape(2,-1).T\n", "lp_test_seeds = np.concatenate((lp_test_seeds, lp_test_neg_seeds))\n", "print(f\"Part of test seeds for link prediction: {lp_test_seeds[:3]}\")\n", "np.save(lp_test_seeds_path, lp_test_seeds)\n", "print(f\"LP test seeds are saved to {lp_test_seeds_path}\\n\")\n", "\n", "lp_test_labels_path = os.path.join(base_dir, \"lp-test-labels.npy\")\n", "lp_test_labels = np.empty(num_tests * (10 + 1))\n", "lp_test_labels[:num_tests] = 1\n", "lp_test_labels[num_tests:] = 0\n", "print(f\"Part of val labels for link prediction: {lp_test_labels[:3]}\")\n", "np.save(lp_test_labels_path, lp_test_labels)\n", "print(f\"LP test labels are saved to {lp_test_labels_path}\\n\")\n", "\n", "lp_test_indexes_path = os.path.join(base_dir, \"lp-test-indexes.npy\")\n", "lp_test_indexes = np.arange(0, num_tests)\n", "lp_test_neg_indexes = np.repeat(lp_test_indexes, 10)\n", "lp_test_indexes = np.concatenate([lp_test_indexes, lp_test_neg_indexes])\n", "print(f\"Part of test indexes for link prediction: {lp_test_indexes[:3]}\")\n", "np.save(lp_test_indexes_path, lp_test_indexes)\n", "print(f\"LP test indexes are saved to {lp_test_indexes_path}\\n\")" ] }, { "cell_type": "markdown", "metadata": { "id": "wbk6-wxRK-6S" }, "source": [ "## Organize Data into YAML File\n", "Now we need to create a `metadata.yaml` file which contains the paths, dadta types of graph structure, feature data, training/validation/test sets.\n", "\n", "Notes:\n", "- all path should be relative to `metadata.yaml`.\n", "- Below fields are optional and not specified in below example.\n", " - `in_memory`: indicates whether to load dada into memory or `mmap`. Default is `True`.\n", "\n", "Please refer to [YAML specification](https://github.com/dmlc/dgl/blob/master/docs/source/stochastic_training/ondisk-dataset-specification.rst) for more details." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "ddGTWW61Lpwp" }, "outputs": [], "source": [ "yaml_content = f\"\"\"\n", " dataset_name: homogeneous_graph_nc_lp\n", " graph:\n", " nodes:\n", " - num: {num_nodes}\n", " edges:\n", " - format: csv\n", " path: {os.path.basename(edges_path)}\n", " feature_data:\n", " - domain: node\n", " name: feat_0\n", " format: numpy\n", " path: {os.path.basename(node_feat_0_path)}\n", " - domain: node\n", " name: feat_1\n", " format: torch\n", " path: {os.path.basename(node_feat_1_path)}\n", " - domain: edge\n", " name: feat_0\n", " format: numpy\n", " path: {os.path.basename(edge_feat_0_path)}\n", " - domain: edge\n", " name: feat_1\n", " format: torch\n", " path: {os.path.basename(edge_feat_1_path)}\n", " tasks:\n", " - name: node_classification\n", " num_classes: 10\n", " train_set:\n", " - data:\n", " - name: seeds\n", " format: numpy\n", " path: {os.path.basename(nc_train_ids_path)}\n", " - name: labels\n", " format: torch\n", " path: {os.path.basename(nc_train_labels_path)}\n", " validation_set:\n", " - data:\n", " - name: seeds\n", " format: numpy\n", " path: {os.path.basename(nc_val_ids_path)}\n", " - name: labels\n", " format: torch\n", " path: {os.path.basename(nc_val_labels_path)}\n", " test_set:\n", " - data:\n", " - name: seeds\n", " format: numpy\n", " path: {os.path.basename(nc_test_ids_path)}\n", " - name: labels\n", " format: torch\n", " path: {os.path.basename(nc_test_labels_path)}\n", " - name: link_prediction\n", " num_classes: 10\n", " train_set:\n", " - data:\n", " - name: seeds\n", " format: numpy\n", " path: {os.path.basename(lp_train_seeds_path)}\n", " validation_set:\n", " - data:\n", " - name: seeds\n", " format: numpy\n", " path: {os.path.basename(lp_val_seeds_path)}\n", " - name: labels\n", " format: numpy\n", " path: {os.path.basename(lp_val_labels_path)}\n", " - name: indexes\n", " format: numpy\n", " path: {os.path.basename(lp_val_indexes_path)}\n", " test_set:\n", " - data:\n", " - name: seeds\n", " format: numpy\n", " path: {os.path.basename(lp_test_seeds_path)}\n", " - name: labels\n", " format: numpy\n", " path: {os.path.basename(lp_test_labels_path)}\n", " - name: indexes\n", " format: numpy\n", " path: {os.path.basename(lp_test_indexes_path)}\n", "\"\"\"\n", "metadata_path = os.path.join(base_dir, \"metadata.yaml\")\n", "with open(metadata_path, \"w\") as f:\n", " f.write(yaml_content)" ] }, { "cell_type": "markdown", "metadata": { "id": "kEfybHGhOW7O" }, "source": [ "## Instantiate `OnDiskDataset`\n", "Now we're ready to load dataset via `dgl.graphbolt.OnDiskDataset`. When instantiating, we just pass in the base directory where `metadata.yaml` file lies.\n", "\n", "During first instantiation, GraphBolt preprocesses the raw data such as constructing `FusedCSCSamplingGraph` from edges. All data including graph, feature data, training/validation/test sets are put into `preprocessed` directory after preprocessing. Any following dataset loading will skip the preprocess stage.\n", "\n", "After preprocessing, `load()` is required to be called explicitly in order to load graph, feature data and tasks." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "W58CZoSzOiyo" }, "outputs": [], "source": [ "dataset = gb.OnDiskDataset(base_dir).load()\n", "graph = dataset.graph\n", "print(f\"Loaded graph: {graph}\\n\")\n", "\n", "feature = dataset.feature\n", "print(f\"Loaded feature store: {feature}\\n\")\n", "\n", "tasks = dataset.tasks\n", "nc_task = tasks[0]\n", "print(f\"Loaded node classification task: {nc_task}\\n\")\n", "lp_task = tasks[1]\n", "print(f\"Loaded link prediction task: {lp_task}\\n\")" ] } ], "metadata": { "colab": { "private_outputs": true, "provenance": [] }, "kernelspec": { "display_name": "Python 3", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.12" } }, "nbformat": 4, "nbformat_minor": 0 }