"By the end of this tutorial, you will be able to\n",
"- organize graph structure data.\n",
"- organize feature data.\n",
"- organize training/validation/test set for specific tasks."
"- organize training/validation/test set for specific tasks.\n",
"\n",
"To create an ``OnDiskDataset`` object, you need to organize all the data including graph structure, feature data and tasks into a directory. The directory should contain a ``metadata.yaml`` file that describes the metadata of the dataset.\n",
"\n",
"Now let's generate various data step by step and organize them together to instantiate `OnDiskDataset` finally."
],
"metadata": {
"id": "FnFhPMaAfLtJ"
...
...
@@ -71,6 +74,387 @@
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"## Data preparation\n",
"In order to demonstrate how to organize various data, let's create a base directory first."
],
"metadata": {
"id": "2R7WnSbjsfbr"
}
},
{
"cell_type": "code",
"source": [
"base_dir = './ondisk_dataset_heterograph'\n",
"os.makedirs(base_dir, exist_ok=True)\n",
"print(f\"Created base directory: {base_dir}\")"
],
"metadata": {
"id": "SZipbzyltLfO"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"### Generate graph structure data\n",
"For heterogeneous graph, we just need to save edges(namely node pairs) into **CSV** file.\n",
"\n",
"Note:\n",
"when saving to file, do not save index and header.*italicized text*\n"
"print(f\"Part of edge feature [feat_1]: {edge_feat_1[:10, :]}\")\n",
"torch.save(edge_feat_1, edge_feat_1_path)\n",
"print(f\"Edge feature [feat_1] is saved to {edge_feat_1_path}\")\n"
],
"metadata": {
"id": "_PVu1u5brBhF"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"### Generate tasks\n",
"`OnDiskDataset` supports multiple tasks. For each task, we need to prepare training/validation/test sets respectively. Such sets usually vary among different tasks. In this tutorial, let's create a **Node Classification** task and **Link Prediction** task."
],
"metadata": {
"id": "ZyqgOtsIwzh_"
}
},
{
"cell_type": "markdown",
"source": [
"#### Node Classification Task\n",
"For node classification task, we need **node IDs** and corresponding **labels** for each training/validation/test set. Like feature data, numpy arrays and torch tensors are supported for these sets."
"print(f\"NC test labels are saved to {nc_test_labels_path}\")"
],
"metadata": {
"id": "S5-fyBbHzTCO"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"#### Link Prediction Task\n",
"For link prediction task, we need **node pairs** or **negative src/dsts** for each training/validation/test set. Like feature data, numpy arrays and torch tensors are supported for these sets."
"print(f\"LP test negative dsts are saved to {lp_test_neg_dsts_path}\")"
],
"metadata": {
"id": "u0jCnXIcAQy4"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"## Organize Data into YAML File\n",
"Now we need to create a `metadata.yaml` file which contains the paths, dadta types of graph structure, feature data, training/validation/test sets. Please note that all path should be relative to `metadata.yaml`."
"Now we're ready to load dataset via `dgl.graphbolt.OnDiskDataset`. When instantiating, we just pass in the base directory where `metadata.yaml` file lies.\n",
"\n",
"During first instantiation, GraphBolt preprocesses the raw data such as constructing `FusedCSCSamplingGraph` from edges. All data including graph, feature data, training/validation/test sets are put into `preprocessed` directory after preprocessing. Any following dataset loading will skip the preprocess stage.\n",
"\n",
"After preprocessing, `load()` is required to be called explicitly in order to load graph, feature data and tasks."