[doc] update OnDiskDataset (#6811)

3e59b1d4 · Rhett Ying · GitHub · 7094ff4f · 3e59b1d4 · 3e59b1d4
Unverified Commit 3e59b1d4 authored Dec 22, 2023 by Rhett Ying Committed by GitHub Dec 22, 2023
4 changed files
--- a/docs/source/stochastic_training/ondisk-dataset.rst
+++ b/docs/source/stochastic_training/ondisk-dataset.rst
 .. _stochastic_training-ondisk-dataset:

-Creating OnDiskDataset
-======================
+Composing OnDiskDataset from raw data
+=====================================

-This tutorial shows how to create an `OnDiskDataset` from raw data and use it
-for stochastic training.
+This tutorial shows how to compose :class:`~dgl.graphbolt.OnDiskDataset` from
+raw data. A full specification of ``metadata.yaml`` is also provided.

 **GraphBolt** provides the ``OnDiskDataset`` class to help user organize plain
 data of graph strucutre, feature data and tasks. ``OnDiskDataset`` is also
 designed to efficiently handle large graphs and features that do not fit into
 memory by storing them on disk.

-For more details about `OnDiskDataset`, please refer to the
-:class:`~dgl.graphbolt.OnDiskDataset` API documentation.
-
 .. toctree::
    :maxdepth: 1
    :glob:

--- a/notebooks/stochastic_training/ondisk_dataset_heterograph.ipynb
+++ b/notebooks/stochastic_training/ondisk_dataset_heterograph.ipynb
@@ -25,6 +25,7 @@
        "This tutorial shows how to create `OnDiskDataset` for heterogeneous graph that could be used in **GraphBolt** framework. The major difference from creating dataset for homogeneous graph is that we need to specify node/edge types for edges, feature data, training/validation/test sets.\n",
        "\n",
        "By the end of this tutorial, you will be able to\n",
+        "\n",
        "- organize graph structure data.\n",
        "- organize feature data.\n",
        "- organize training/validation/test set for specific tasks.\n",
@@ -104,7 +105,7 @@
        "### Generate graph structure data\n",
        "For heterogeneous graph, we need to save different edge edges(namely node pairs) into separate **CSV** files.\n",
        "\n",
-        "Note:\n",
+        "**Note**:\n",
        "when saving to file, do not save index and header.\n"
      ],
      "metadata": {

--- a/notebooks/stochastic_training/ondisk_dataset_homograph.ipynb
+++ b/notebooks/stochastic_training/ondisk_dataset_homograph.ipynb
@@ -25,6 +25,7 @@
        "This tutorial shows how to create `OnDiskDataset` for homogeneous graph that could be used in **GraphBolt** framework.\n",
        "\n",
        "By the end of this tutorial, you will be able to\n",
+        "\n",
        "- organize graph structure data.\n",
        "- organize feature data.\n",
        "- organize training/validation/test set for specific tasks.\n",

--- a/python/dgl/graphbolt/impl/ondisk_dataset.py
+++ b/python/dgl/graphbolt/impl/ondisk_dataset.py
@@ -281,22 +281,33 @@ class OnDiskDataset(Dataset):
    Due to limited resources, the data which are too large to fit into RAM will
    remain on disk while others reside in RAM once ``OnDiskDataset`` is
    initialized. This behavior could be controled by user via ``in_memory``
-    field in YAML file.
+    field in YAML file. All paths in YAML file are relative paths to the
+    dataset directory.

    A full example of YAML file is as follows:

    .. code-block:: yaml

        dataset_name: graphbolt_test
-        graph_topology:
-          type: FusedCSCSamplingGraph
-          path: graph_topology/fused_csc_sampling_graph.tar
+        graph:
+          nodes:
+            - type: paper # could be omitted for homogeneous graph.
+              num: 1000
+            - type: author
+              num: 1000
+          edges:
+            - type: author:writes:paper # could be omitted for homogeneous graph.
+              format: csv # Can be csv only.
+              path: edge_data/author-writes-paper.csv
+            - type: paper:cites:paper
+              format: csv
+              path: edge_data/paper-cites-paper.csv
        feature_data:
          - domain: node
-            type: paper
+            type: paper # could be omitted for homogeneous graph.
            name: feat
            format: numpy
-            in_memory: false
+            in_memory: false # If not specified, default to true.
            path: node_data/paper-feat.npy
          - domain: edge
            type: "author:writes:paper"
@@ -308,37 +319,35 @@ class OnDiskDataset(Dataset):
          - name: "edge_classification"
            num_classes: 10
            train_set:
-              - type: paper # could be null for homogeneous graph.
-                data: # multiple data sources could be specified.
-                  - name: node_pairs
-                    format: numpy
+              - data: # multiple data sources could be specified.
+                  - type: paper
+                    name: node_pairs
+                    format: numpy # Can be numpy or torch.
                    in_memory: true # If not specified, default to true.
                    path: set/paper-train-node_pairs.npy
-                  - name: labels
+                  - type: paper
+                    name: labels
                    format: numpy
-                    in_memory: false
                    path: set/paper-train-labels.npy
            validation_set:
+              - data:
                  - type: paper
-                data:
-                  - name: node_pairs
+                    name: node_pairs
                    format: numpy
-                    in_memory: true
                    path: set/paper-validation-node_pairs.npy
-                  - name: labels
+                  - type: paper
+                    name: labels
                    format: numpy
-                    in_memory: true
                    path: set/paper-validation-labels.npy
            test_set:
+              - data:
                  - type: paper
-                data:
-                  - name: node_pairs
+                    name: node_pairs
                    format: numpy
-                    in_memory: true
                    path: set/paper-test-node_pairs.npy
-                  - name: labels
+                  - type: paper
+                    name: labels
                    format: numpy
-                    in_memory: true
                    path: set/paper-test-labels.npy

    Parameters