[GraphBolt] add notebooks for OnDiskDataset (#6771)

358db43a · Rhett Ying · GitHub · 65d83ad7 · 358db43a · 358db43a
Unverified Commit 358db43a authored Dec 18, 2023 by Rhett Ying Committed by GitHub Dec 18, 2023
7 changed files
--- a/docs/source/stochastic_training/index.rst
+++ b/docs/source/stochastic_training/index.rst
@@ -9,9 +9,8 @@ GraphBolt.
 .. toctree::
  :maxdepth: 1
-  :titlesonly:
  neighbor_sampling_overview.nblink
  node_classification.nblink
  link_prediction.nblink
-  ondisk-dataset
+  ondisk-dataset.rst
--- a/docs/source/stochastic_training/ondisk-dataset-specification.rst
+++ b/docs/source/stochastic_training/ondisk-dataset-specification.rst
+.. _stochastic_training-ondisk-dataset-specification:
+Prepare dataset
+===============
+**GraphBolt** provides the ``OnDiskDataset`` class to help user organize plain
+data of graph strucutre, feature data and tasks. ``OnDiskDataset`` is also
+designed to efficiently handle large graphs and features that do not fit into
+memory by storing them on disk.
+To create an ``OnDiskDataset`` object, you need to organize all the data including
+graph structure, feature data and tasks into a directory. The directory should
+contain a ``metadata.yaml`` file that describes the metadata of the dataset.
+Then just pass the directory path to the ``OnDiskDataset`` constructor to create
+the dataset object.
+.. code:: python
+    from dgl.graphbolt import OnDiskDataset
+    dataset = OnDiskDataset('/path/to/dataset')
+The returned ``dataset`` object just loads the yaml file and does not load any
+data. To load the graph structure, feature data and tasks, you need to call
+the ``load`` method.
+.. code:: python
+    dataset.load()
+The reason why we separate the ``OnDiskDataset`` object creation and data loading
+is that you may want to change some fields in the ``metadata.yaml`` file before
+loading the data. For example, you may want to change the path of the feature
+data files to point to a different directory. In this case, you can just
+modify the path via ``dataset.yaml_data`` directly. Then call the ``load`` method
+again to load the data.
+After loading the data, you can access the graph structure, feature data and
+tasks through the ``graph``, ``feature`` and ``tasks`` attributes respectively.
+.. code:: python
+    graph = dataset.graph
+    feature = dataset.feature
+    tasks = dataset.tasks
+The returned ``graph`` is a ``FusedCSCSamplingGraph`` object, which will be used
+for sampling. The returned ``feature`` is a ``TorchBasedFeatureStore`` object,
+which will be used for feature lookup. The returned ``tasks`` is a list of
+``Task`` objects, which will be used for training and evaluation.
+The following examples show data folder structure and ``metadata.yaml`` file for
+homogeneous graphs and heterogeneous graphs respectively. If you want to know
+the full YAML specification, please refer to the `Full YAML specification`_ section.
+Homogeneous graph
+-----------------
+Data folder structure:
+^^^^^^^^^^^^^^^^^^^^^
+.. code::
+    data/
+      node_feat.npy
+      edge_feat.npy
+    edges/
+      edges.csv
+    set_nc/
+      train_seed_nodes.npy
+      train_labels.npy
+      val_seed_nodes.npy
+      val_labels.npy
+      test_seed_nodes.npy
+      test_labels.npy
+    set_lp/
+      train_node_pairs.npy
+      val_node_pairs.npy
+      val_negative_dsts.npy
+      test_node_pairs.npy
+      test_negative_dsts.npy
+    metadata.yaml
+``metadata.yaml`` file:
+^^^^^^^^^^^^^^^^^^^^^
+.. code:: yaml
+    dataset_name: homogeneous_graph_nc_lp
+    graph:
+      nodes:
+        - num: 10
+      edges:
+        - format: csv
+          path: edges/edges.csv
+    feature_data:
+      - domain: node
+        name: feat
+        format: numpy
+        in_memory: true
+        path: data/node_feat.npy
+      - domain: edge
+        name: feat
+        format: numpy
+        in_memory: true
+        path: data/edge_feat.npy
+    tasks:
+      - name: node_classification
+        num_classes: 2
+        train_set:
+          - data:
+              - name: seed_nodes
+                format: numpy
+                in_memory: true
+                path: set_nc/train_seed_nodes.npy
+              - name: labels
+                format: numpy
+                in_memory: true
+                path: set_nc/train_labels.npy
+        validation_set:
+          - data:
+              - name: seed_nodes
+                format: numpy
+                in_memory: true
+                path: set_nc/val_seed_nodes.npy
+              - name: labels
+                format: numpy
+                in_memory: true
+                path: set_nc/val_labels.npy
+        test_set:
+          - data:
+              - name: seed_nodes
+                format: numpy
+                in_memory: true
+                path: set_nc/test_seed_nodes.npy
+              - name: labels
+                format: numpy
+                in_memory: true
+                path: set_nc/test_labels.npy
+      - name: link_prediction
+        num_classes: 2
+        train_set:
+          - data:
+              - name: node_pairs
+                format: numpy
+                in_memory: true
+                path: set_lp/train_node_pairs.npy
+        validation_set:
+          - data:
+              - name: node_pairs
+                format: numpy
+                in_memory: true
+                path: set_lp/val_node_pairs.npy
+              - name: negative_dsts
+                format: numpy
+                in_memory: true
+                path: set_lp/val_negative_dsts.npy
+        test_set:
+          - data:
+              - name: node_pairs
+                format: numpy
+                in_memory: true
+                path: set_lp/test_node_pairs.npy
+              - name: negative_dsts
+                format: numpy
+                in_memory: true
+                path: set_lp/test_negative_dsts.npy
+For the graph structure, number of nodes is specified by the ``num`` field and
+edges are stored in a csv file in format of ``<src, dst>`` like below.
+.. code:: csv
+    edges.csv
+    0,1
+    1,2
+    2,3
+    3,4
+    4,5
+    5,6
+    6,7
+    7,8
+    8,9
+For the feature data, we have feature data named as ``feat`` for nodes and
+edges. The feature data are stored in numpy files in shape of ``[num_nodes, 10]``
+and ``[num_edges, 10]`` respectively like below.
+.. code:: python
+    node_feat.npy
+    array([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
+           [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
+           [2., 2., 2., 2., 2., 2., 2., 2., 2., 2.],
+           [3., 3., 3., 3., 3., 3., 3., 3., 3., 3.],
+           [4., 4., 4., 4., 4., 4., 4., 4., 4., 4.],
+           [5., 5., 5., 5., 5., 5., 5., 5., 5., 5.],
+           [6., 6., 6., 6., 6., 6., 6., 6., 6., 6.],
+           [7., 7., 7., 7., 7., 7., 7., 7., 7., 7.],
+           [8., 8., 8., 8., 8., 8., 8., 8., 8., 8.],
+           [9., 9., 9., 9., 9., 9., 9., 9., 9., 9.]])
+    edge_feat.npy
+    array([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
+           [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
+           [2., 2., 2., 2., 2., 2., 2., 2., 2., 2.],
+           [3., 3., 3., 3., 3., 3., 3., 3., 3., 3.],
+           [4., 4., 4., 4., 4., 4., 4., 4., 4., 4.],
+           [5., 5., 5., 5., 5., 5., 5., 5., 5., 5.],
+           [6., 6., 6., 6., 6., 6., 6., 6., 6., 6.],
+           [7., 7., 7., 7., 7., 7., 7., 7., 7., 7.],
+           [8., 8., 8., 8., 8., 8., 8., 8., 8., 8.],
+           [9., 9., 9., 9., 9., 9., 9., 9., 9., 9.]])
+For the ``tasks`` field, we have two tasks: ``node_classification`` and
+``link_prediction``. For each task, we have three sets: ``train_set``,
+``validation_set`` and ``test_set``.
+For ``node_classification`` task, we have two fields: ``seed_nodes`` and
+``labels``. The ``seed_nodes`` field is used to specify the node IDs for
+training and evaluation. The ``labels`` field is used to specify the
+labels. Both of them are stored in numpy files with shape of ``[num_nodes]``
+like below.
+.. code:: python
+    train_seed_nodes.npy
+    array([0, 1, 2, 3, 4, 5])
+    train_labels.npy
+    array([0, 1, 0, 1, 0, 1])
+    val_seed_nodes.npy
+    array([6, 7])
+    val_labels.npy
+    array([0, 1])
+    test_seed_nodes.npy
+    array([8, 9])
+    test_labels.npy
+    array([0, 1])
+For ``link_prediction`` task, we have two fields: ``node_pairs``,
+``negative_dsts``.  The ``node_pairs`` field is used to specify the node pairs.
+The ``negative_dsts`` field is used to specify the negative destination nodes.
+They are stored in numpy file with shape of ``[num_edges, 2]`` and
+``[num_edges, num_neg_dsts]`` respectively like below.
+.. code:: python
+    train_node_pairs.npy
+    array([[0, 1],
+           [1, 2],
+           [2, 3],
+           [3, 4],
+           [4, 5],
+           [5, 6]])
+    val_node_pairs.npy
+    array([[6, 7],
+           [7, 8]])
+    val_negative_dsts.npy
+    array([[8, 9],
+           [8, 9]])
+    test_node_pairs.npy
+    array([[8, 9],
+           [9, 0]])
+    test_negative_dsts.npy
+    array([[0, 1],
+           [0, 1]])
+.. note::
+    The values of ``name`` fields in the ``task`` such as ``seed_nodes``,
+    ``labels``, ``node_pairs`` and ``negative_dsts`` are mandatory. They are
+    used to specify the data fields of ``MiniBatch`` for sampling. The values
+    of ``name`` fields in the ``feature_data`` such as ``feat`` are user-defined.
+Heterogeneous graph
+-----------------
+Data folder structure:
+^^^^^^^^^^^^^^^^^^^^^
+.. code::
+    data/
+      user_feat.npy
+      item_feat.npy
+      user_follow_user_feat.npy
+      user_click_item_feat.npy
+    edges/
+      user_follow_user.csv
+      user_click_item.csv
+    set_nc/
+      user_train_seed_nodes.npy
+      user_train_labels.npy
+      user_val_seed_nodes.npy
+      user_val_labels.npy
+      user_test_seed_nodes.npy
+      user_test_labels.npy
+    set_lp/
+      follow_train_node_pairs.npy
+      follow_val_node_pairs.npy
+      follow_val_negative_dsts.npy
+      follow_test_node_pairs.npy
+      follow_test_negative_dsts.npy
+    metadata.yaml
+``metadata.yaml`` file:
+^^^^^^^^^^^^^^^^^^^^^
+.. code:: yaml
+    dataset_name: heterogeneous_graph_nc_lp
+    graph:
+      nodes:
+        - type: user
+          num: 10
+        - type: item
+          num: 10
+      edges:
+        - type: "user:follow:user"
+          format: csv
+          path: edges/user_follow_user.csv
+        - type: "user:click:item"
+          format: csv
+          path: edges/user_click_item.csv
+    feature_data:
+      - domain: node
+        type: user
+        name: feat
+        format: numpy
+        in_memory: true
+        path: data/user_feat.npy
+      - domain: node
+        type: item
+        name: feat
+        format: numpy
+        in_memory: true
+        path: data/item_feat.npy
+      - domain: edge
+        type: "user:follow:user"
+        name: feat
+        format: numpy
+        in_memory: true
+        path: data/user_follow_user_feat.npy
+      - domain: edge
+        type: "user:click:item"
+        name: feat
+        format: numpy
+        in_memory: true
+        path: data/user_click_item_feat.npy
+    tasks:
+      - name: node_classification
+        num_classes: 2
+        train_set:
+          - type: user
+            data:
+              - name: seed_nodes
+                format: numpy
+                in_memory: true
+                path: set/user_train_seed_nodes.npy
+              - name: labels
+                format: numpy
+                in_memory: true
+                path: set/user_train_labels.npy
+        validation_set:
+          - type: user
+            data:
+              - name: seed_nodes
+                format: numpy
+                in_memory: true
+                path: set/user_val_seed_nodes.npy
+              - name: labels
+                format: numpy
+                in_memory: true
+                path: set/user_val_labels.npy
+        test_set:
+          - type: user
+            data:
+              - name: seed_nodes
+                format: numpy
+                in_memory: true
+                path: set/user_test_seed_nodes.npy
+              - name: labels
+                format: numpy
+                in_memory: true
+                path: set/user_test_labels.npy
+      - name: link_prediction
+        num_classes: 2
+        train_set:
+          - type: "user:follow:user"
+            data:
+              - name: node_pairs
+                format: numpy
+                in_memory: true
+                path: set/follow_train_node_pairs.npy
+        validation_set:
+          - type: "user:follow:user"
+            data:
+              - name: node_pairs
+                format: numpy
+                in_memory: true
+                path: set/follow_val_node_pairs.npy
+              - name: negative_dsts
+                format: numpy
+                in_memory: true
+                path: set/follow_val_negative_dsts.npy
+        test_set:
+          - type: "user:follow:user"
+            data:
+              - name: node_pairs
+                format: numpy
+                in_memory: true
+                path: set/follow_test_node_pairs.npy
+              - name: negative_dsts
+                format: numpy
+                in_memory: true
+                path: set/follow_test_negative_dsts.npy
+For the graph structure, we have two types of nodes: ``user`` and ``item``
+in above example. Number of each node type is specified by the ``num`` field.
+We have two types of edges: ``user:follow:user`` and ``user:click:item``.
+The edges are stored in two columns of csv files like below.
+.. code:: csv
+    user_follow_user.csv
+    0,1
+    1,2
+    2,3
+    3,4
+    4,5
+    5,6
+    6,7
+    7,8
+    8,9
+    user_click_item.csv
+    0,0
+    1,1
+    2,2
+    3,3
+    4,4
+    5,5
+    6,6
+    7,7
+    8,8
+    9,9
+For the feature data, we have feature data named as ``feat`` for nodes and
+edges. The feature data are stored in numpy files in shape of ``[num_nodes, 10]``
+and ``[num_edges, 10]`` respectively like below.
+.. code:: python
+    user_feat.npy
+    array([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
+           [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
+           [2., 2., 2., 2., 2., 2., 2., 2., 2., 2.],
+           [3., 3., 3., 3., 3., 3., 3., 3., 3., 3.],
+           [4., 4., 4., 4., 4., 4., 4., 4., 4., 4.],
+           [5., 5., 5., 5., 5., 5., 5., 5., 5., 5.],
+           [6., 6., 6., 6., 6., 6., 6., 6., 6., 6.],
+           [7., 7., 7., 7., 7., 7., 7., 7., 7., 7.],
+           [8., 8., 8., 8., 8., 8., 8., 8., 8., 8.],
+           [9., 9., 9., 9., 9., 9., 9., 9., 9., 9.]])
+    item_feat.npy
+    array([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
+           [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
+           [2., 2., 2., 2., 2., 2., 2., 2., 2., 2.],
+           [3., 3., 3., 3., 3., 3., 3., 3., 3., 3.],
+           [4., 4., 4., 4., 4., 4., 4., 4., 4., 4.],
+           [5., 5., 5., 5., 5., 5., 5., 5., 5., 5.],
+           [6., 6., 6., 6., 6., 6., 6., 6., 6., 6.],
+           [7., 7., 7., 7., 7., 7., 7., 7., 7., 7.],
+           [8., 8., 8., 8., 8., 8., 8., 8., 8., 8.],
+           [9., 9., 9., 9., 9., 9., 9., 9., 9., 9.]])
+    user_follow_user_feat.npy
+    array([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
+           [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
+           [2., 2., 2., 2., 2., 2., 2., 2., 2., 2.],
+           [3., 3., 3., 3., 3., 3., 3., 3., 3., 3.],
+           [4., 4., 4., 4., 4., 4., 4., 4., 4., 4.],
+           [5., 5., 5., 5., 5., 5., 5., 5., 5., 5.],
+           [6., 6., 6., 6., 6., 6., 6., 6., 6., 6.],
+           [7., 7., 7., 7., 7., 7., 7., 7., 7., 7.],
+           [8., 8., 8., 8., 8., 8., 8., 8., 8., 8.],
+           [9., 9., 9., 9., 9., 9., 9., 9., 9., 9.]])
+    user_click_item_feat.npy
+    array([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
+           [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
+           [2., 2., 2., 2., 2., 2., 2., 2., 2., 2.],
+           [3., 3., 3., 3., 3., 3., 3., 3., 3., 3.],
+           [4., 4., 4., 4., 4., 4., 4., 4., 4., 4.],
+           [5., 5., 5., 5., 5., 5., 5., 5., 5., 5.],
+           [6., 6., 6., 6., 6., 6., 6., 6., 6., 6.],
+           [7., 7., 7., 7., 7., 7., 7., 7., 7., 7.],
+           [8., 8., 8., 8., 8., 8., 8., 8., 8., 8.],
+           [9., 9., 9., 9., 9., 9., 9., 9., 9., 9.]])
+For the ``tasks`` field, we have two tasks: ``node_classification`` and
+``link_prediction``. For each task, we have three sets: ``train_set``,
+``validation_set`` and ``test_set``.
+For ``node_classification`` task, we have two fields: ``seed_nodes`` and
+``labels``. The ``seed_nodes`` field is used to specify the node IDs for
+training and evaluation. The ``labels`` field is used to specify the
+labels. Both of them are stored in numpy files with shape of ``[num_nodes]``
+like below.
+.. code:: python
+    user_train_seed_nodes.npy
+    array([0, 1, 2, 3, 4, 5]])
+    user_train_labels.npy
+    array([0, 1, 0, 1, 0, 1])
+    user_val_seed_nodes.npy
+    array([6, 7])
+    user_val_labels.npy
+    array([0, 1])
+    user_test_seed_nodes.npy
+    array([8, 9])
+    user_test_labels.npy
+    array([0, 1])
+For ``link_prediction`` task, we have two fields: ``node_pairs``,
+``negative_dsts``.  The ``node_pairs`` field is used to specify the node pairs.
+The ``negative_dsts`` field is used to specify the negative destination nodes.
+They are stored in numpy file with shape of ``[num_edges, 2]`` and
+``[num_edges, num_neg_dsts]`` respectively like below.
+.. code:: python
+    follow_train_node_pairs.npy
+    array([[0, 1]
+           [1, 2]
+           [2, 3]
+           [3, 4]
+           [4, 5]
+           [5, 6]]])
+    follow_val_node_pairs.npy
+    array([[6, 7]
+           [7, 8]])
+    follow_val_negative_dsts.npy
+    array([[8, 9]
+           [8, 9]]])
+    follow_test_node_pairs.npy
+    array([[8, 9]
+           [9, 0]]])
+    follow_test_negative_dsts.npy
+    array([[0, 1]
+           [0, 1]]])
+Full YAML specification
+-----------------------
+The full YAML specification of ``metadata.yaml`` file is shown below.
+.. code:: yaml
+    dataset_name: <string>
+    graph:
+      nodes:
+        - type: <string>
+          num: <int>
+        - type: <string>
+          num: <int>
+      edges:
+        - type: <string>
+          format: <string>
+          path: <string>
+        - type: <string>
+          format: <string>
+          path: <string>
+    feature_data:
+      - domain: node
+        type: <string>
+        name: <string>
+        format: <string>
+        in_memory: <bool>
+        path: <string>
+      - domain: node
+        type: <string>
+        name: <string>
+        format: <string>
+        in_memory: <bool>
+        path: <string>
+      - domain: edge
+        type: <string>
+        name: <string>
+        format: <string>
+        in_memory: <bool>
+        path: <string>
+      - domain: edge
+        type: <string>
+        name: <string>
+        format: <string>
+        in_memory: <bool>
+        path: <string>
+    tasks:
+      - name: <string>
+        num_classes: <int>
+        train_set:
+          - type: <string>
+            data:
+              - name: <string>
+                format: <string>
+                in_memory: <bool>
+                path: <string>
+              - name: <string>
+                format: <string>
+                in_memory: <bool>
+                path: <string>
+        validation_set:
+          - type: <string>
+            data:
+              - name: <string>
+                format: <string>
+                in_memory: <bool>
+                path: <string>
+              - name: <string>
+                format: <string>
+                in_memory: <bool>
+                path: <string>
+        test_set:
+          - type: <string>
+            data:
+              - name: <string>
+                format: <string>
+                in_memory: <bool>
+                path: <string>
+              - name: <string>
+                format: <string>
+                in_memory: <bool>
+                path: <string>
+``dataset_name``
+^^^^^^^^^^^^^^^
+The ``dataset_name`` field is used to specify the name of the dataset. It is
+user-defined.
+``graph``
+^^^^^^^^
+The ``graph`` field is used to specify the graph structure. It has two fields:
+``nodes`` and ``edges``.
+ - ``nodes``: ``list``
+   The ``nodes`` field is used to specify the number of nodes for each node type.
+   It is a list of ``node`` objects. Each ``node`` object has two fields: ``type``
+   and ``num``.
+    - ``type``: ``string``, optional
+      The ``type`` field is used to specify the node type. It is ``null`` for
+      homogeneous graphs. For heterogeneous graphs, it is the node type.
+    - ``num``: ``int``
+      The ``num`` field is used to specify the number of nodes for the node type.
+      It is mandatory for both homogeneous graphs and heterogeneous graphs.
+  - ``edges``: ``list``
+    The ``edges`` field is used to specify the edges. It is a list of ``edge``
+    objects. Each ``edge`` object has three fields: ``type``, ``format`` and
+    ``path``.
+    - ``type``: ``string``, optional
+      The ``type`` field is used to specify the edge type. It is ``null`` for
+      homogeneous graphs. For heterogeneous graphs, it is the edge type.
+    - ``format``: ``string``
+      The ``format`` field is used to specify the format of the edge data. It can
+      only be ``csv`` for now.
+    - ``path``: ``string``
+      The ``path`` field is used to specify the path of the edge data. It is
+      relative to the directory of ``metadata.yaml`` file.
+``feature_data``
+^^^^^^^^^^^^^^^
+The ``feature_data`` field is used to specify the feature data. It is a list of
+``feature`` objects. Each ``feature`` object has five canonical fields: ``domain``,
+``type``, ``name``, ``format`` and ``path``. Any other fields will be passed to
+the ``Feature.metadata`` object.
+ - ``domain``: ``string``
+   The ``domain`` field is used to specify the domain of the feature data. It can
+   be either ``node`` or ``edge``.
+ - ``type``: ``string``, optional
+   The ``type`` field is used to specify the type of the feature data. It is
+   ``null`` for homogeneous graphs. For heterogeneous graphs, it is the node or
+   edge type.
+  - ``name``: ``string``
+    The ``name`` field is used to specify the name of the feature data. It is
+    user-defined.
+  - ``format``: ``string``
+    The ``format`` field is used to specify the format of the feature data. It can
+    be either ``numpy`` or ``torch``.
+  - ``in_memory``: ``bool``, optional
+    The ``in_memory`` field is used to specify whether the feature data is loaded
+    into memory. It can be either ``true`` or ``false``. Default is ``true``.
+  - ``path``: ``string``
+    The ``path`` field is used to specify the path of the feature data. It is
+    relative to the directory of ``metadata.yaml`` file.
+``tasks``
+^^^^^^^^
+The ``tasks`` field is used to specify the tasks. It is a list of ``task``
+objects. Each ``task`` object has at least three fields: ``train_set``,
+``validation_set``, ``test_set``. And you are free to add other fields
+such as ``num_classes`` and all these fields will be passed to the
+``Task.metadata`` object.
+ - ``name``: ``string``, optional
+   The ``name`` field is used to specify the name of the task. It is user-defined.
+ - ``num_classes``: ``int``, optional
+    The ``num_classes`` field is used to specify the number of classes of the task.
+ - ``train_set``: ``list``
+    The ``train_set`` field is used to specify the training set. It is a list of
+    ``set`` objects. Each ``set`` object has two fields: ``type`` and ``data``.
+  - ``type``: ``string``, optional
+      The ``type`` field is used to specify the node/edge type of the set. It is
+      ``null`` for homogeneous graphs. For heterogeneous graphs, it is the node
+      or edge type.
+  - ``data``: ``list``
+      The ``data`` field is used to load ``train_set``. It is a list of ``data``
+      objects. Each ``data`` object has four fields: ``name``, ``format``,
+      ``in_memory`` and ``path``.
+    - ``name``: ``string``
+        The ``name`` field is used to specify the name of the data. It is mandatory
+        and used to specify the data fields of ``MiniBatch`` for sampling. It can
+        be either ``seed_nodes``, ``labels``, ``node_pairs``, ``negative_srcs`` or 
+        ``negative_dsts``. If any other name is used, it will be added into the
+        ``MiniBatch`` data fields.
+    - ``format``: ``string``
+        The ``format`` field is used to specify the format of the data. It can be
+        either ``numpy`` or ``torch``.
+    - ``in_memory``: ``bool``, optional
+        The ``in_memory`` field is used to specify whether the data is loaded into
+        memory. It can be either ``true`` or ``false``. Default is ``true``.
+    - ``path``: ``string``
+        The ``path`` field is used to specify the path of the data. It is relative
+        to the directory of ``metadata.yaml`` file.
+ - ``validation_set``: ``list``
+ - ``test_set``: ``list``
+    The ``validation_set`` and ``test_set`` fields are used to specify the
+    validation set and test set respectively. They are similar to the
+    ``train_set`` field.
--- a/docs/source/stochastic_training/ondisk-dataset.rst
+++ b/docs/source/stochastic_training/ondisk-dataset.rst
 .. _stochastic_training-ondisk-dataset:
-Prepare dataset
+Creating OnDiskDataset
-===============
+======================
-**GraphBolt** provides the ``OnDiskDataset`` class to help user organize plain
+This tutorial shows how to create an `OnDiskDataset` from raw data and use it
-data of graph strucutre, feature data and tasks. ``OnDiskDataset`` is also
+for stochastic training.
-designed to efficiently handle large graphs and features that do not fit into
-memory by storing them on disk.
-To create an ``OnDiskDataset`` object, you need to organize all the data including
+For more details about `OnDiskDataset`, please refer to the
-graph structure, feature data and tasks into a directory. The directory should
+:class:`~dgl.graphbolt.OnDiskDataset` API documentation.
-contain a ``metadata.yaml`` file that describes the metadata of the dataset.
-Then just pass the directory path to the ``OnDiskDataset`` constructor to create
+.. toctree::
-the dataset object.
+    :maxdepth: 1
+    :glob:
-.. code:: python
-    from dgl.graphbolt import OnDiskDataset
-    dataset = OnDiskDataset('/path/to/dataset')
-The returned ``dataset`` object just loads the yaml file and does not load any
-data. To load the graph structure, feature data and tasks, you need to call
-the ``load`` method.
-.. code:: python
-    dataset.load()
-The reason why we separate the ``OnDiskDataset`` object creation and data loading
-is that you may want to change some fields in the ``metadata.yaml`` file before
-loading the data. For example, you may want to change the path of the feature
-data files to point to a different directory. In this case, you can just
-modify the path via ``dataset.yaml_data`` directly. Then call the ``load`` method
-again to load the data.
-After loading the data, you can access the graph structure, feature data and
-tasks through the ``graph``, ``feature`` and ``tasks`` attributes respectively.
-.. code:: python
-    graph = dataset.graph
-    feature = dataset.feature
-    tasks = dataset.tasks
-The returned ``graph`` is a ``FusedCSCSamplingGraph`` object, which will be used
-for sampling. The returned ``feature`` is a ``TorchBasedFeatureStore`` object,
-which will be used for feature lookup. The returned ``tasks`` is a list of
-``Task`` objects, which will be used for training and evaluation.
-The following examples show data folder structure and ``metadata.yaml`` file for
-homogeneous graphs and heterogeneous graphs respectively. If you want to know
-the full YAML specification, please refer to the `Full YAML specification`_ section.
-Homogeneous graph
-----------------
-Data folder structure:
-^^^^^^^^^^^^^^^^^^^^^
-.. code::
-    data/
-      node_feat.npy
-      edge_feat.npy
-    edges/
-      edges.csv
-    set_nc/
-      train_seed_nodes.npy
-      train_labels.npy
-      val_seed_nodes.npy
-      val_labels.npy
-      test_seed_nodes.npy
-      test_labels.npy
-    set_lp/
-      train_node_pairs.npy
-      val_node_pairs.npy
-      val_negative_dsts.npy
-      test_node_pairs.npy
-      test_negative_dsts.npy
-    metadata.yaml
-``metadata.yaml`` file:
-^^^^^^^^^^^^^^^^^^^^^
-.. code:: yaml
-    dataset_name: homogeneous_graph_nc_lp
-    graph:
-      nodes:
-        - num: 10
-      edges:
-        - format: csv
-          path: edges/edges.csv
-    feature_data:
-      - domain: node
-        name: feat
-        format: numpy
-        in_memory: true
-        path: data/node_feat.npy
-      - domain: edge
-        name: feat
-        format: numpy
-        in_memory: true
-        path: data/edge_feat.npy
-    tasks:
-      - name: node_classification
-        num_classes: 2
-        train_set:
-          - data:
-              - name: seed_nodes
-                format: numpy
-                in_memory: true
-                path: set_nc/train_seed_nodes.npy
-              - name: labels
-                format: numpy
-                in_memory: true
-                path: set_nc/train_labels.npy
-        validation_set:
-          - data:
-              - name: seed_nodes
-                format: numpy
-                in_memory: true
-                path: set_nc/val_seed_nodes.npy
-              - name: labels
-                format: numpy
-                in_memory: true
-                path: set_nc/val_labels.npy
-        test_set:
-          - data:
-              - name: seed_nodes
-                format: numpy
-                in_memory: true
-                path: set_nc/test_seed_nodes.npy
-              - name: labels
-                format: numpy
-                in_memory: true
-                path: set_nc/test_labels.npy
-      - name: link_prediction
-        num_classes: 2
-        train_set:
-          - data:
-              - name: node_pairs
-                format: numpy
-                in_memory: true
-                path: set_lp/train_node_pairs.npy
-        validation_set:
-          - data:
-              - name: node_pairs
-                format: numpy
-                in_memory: true
-                path: set_lp/val_node_pairs.npy
-              - name: negative_dsts
-                format: numpy
-                in_memory: true
-                path: set_lp/val_negative_dsts.npy
-        test_set:
-          - data:
-              - name: node_pairs
-                format: numpy
-                in_memory: true
-                path: set_lp/test_node_pairs.npy
-              - name: negative_dsts
-                format: numpy
-                in_memory: true
-                path: set_lp/test_negative_dsts.npy
-For the graph structure, number of nodes is specified by the ``num`` field and
-edges are stored in a csv file in format of ``<src, dst>`` like below.
-.. code:: csv
-    edges.csv
-    0,1
-    1,2
-    2,3
-    3,4
-    4,5
-    5,6
-    6,7
-    7,8
-    8,9
-For the feature data, we have feature data named as ``feat`` for nodes and
-edges. The feature data are stored in numpy files in shape of ``[num_nodes, 10]``
-and ``[num_edges, 10]`` respectively like below.
-.. code:: python
-    node_feat.npy
-    array([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
-           [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
-           [2., 2., 2., 2., 2., 2., 2., 2., 2., 2.],
-           [3., 3., 3., 3., 3., 3., 3., 3., 3., 3.],
-           [4., 4., 4., 4., 4., 4., 4., 4., 4., 4.],
-           [5., 5., 5., 5., 5., 5., 5., 5., 5., 5.],
-           [6., 6., 6., 6., 6., 6., 6., 6., 6., 6.],
-           [7., 7., 7., 7., 7., 7., 7., 7., 7., 7.],
-           [8., 8., 8., 8., 8., 8., 8., 8., 8., 8.],
-           [9., 9., 9., 9., 9., 9., 9., 9., 9., 9.]])
-    edge_feat.npy
-    array([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
-           [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
-           [2., 2., 2., 2., 2., 2., 2., 2., 2., 2.],
-           [3., 3., 3., 3., 3., 3., 3., 3., 3., 3.],
-           [4., 4., 4., 4., 4., 4., 4., 4., 4., 4.],
-           [5., 5., 5., 5., 5., 5., 5., 5., 5., 5.],
-           [6., 6., 6., 6., 6., 6., 6., 6., 6., 6.],
-           [7., 7., 7., 7., 7., 7., 7., 7., 7., 7.],
-           [8., 8., 8., 8., 8., 8., 8., 8., 8., 8.],
-           [9., 9., 9., 9., 9., 9., 9., 9., 9., 9.]])
-For the ``tasks`` field, we have two tasks: ``node_classification`` and
-``link_prediction``. For each task, we have three sets: ``train_set``,
-``validation_set`` and ``test_set``.
-For ``node_classification`` task, we have two fields: ``seed_nodes`` and
-``labels``. The ``seed_nodes`` field is used to specify the node IDs for
-training and evaluation. The ``labels`` field is used to specify the
-labels. Both of them are stored in numpy files with shape of ``[num_nodes]``
-like below.
-.. code:: python
-    train_seed_nodes.npy
-    array([0, 1, 2, 3, 4, 5])
-    train_labels.npy
-    array([0, 1, 0, 1, 0, 1])
-    val_seed_nodes.npy
-    array([6, 7])
-    val_labels.npy
-    array([0, 1])
-    test_seed_nodes.npy
-    array([8, 9])
-    test_labels.npy
-    array([0, 1])
-For ``link_prediction`` task, we have two fields: ``node_pairs``,
-``negative_dsts``.  The ``node_pairs`` field is used to specify the node pairs.
-The ``negative_dsts`` field is used to specify the negative destination nodes.
-They are stored in numpy file with shape of ``[num_edges, 2]`` and
-``[num_edges, num_neg_dsts]`` respectively like below.
-.. code:: python
-    train_node_pairs.npy
-    array([[0, 1],
-           [1, 2],
-           [2, 3],
-           [3, 4],
-           [4, 5],
-           [5, 6]])
-    val_node_pairs.npy
-    array([[6, 7],
-           [7, 8]])
-    val_negative_dsts.npy
-    array([[8, 9],
-           [8, 9]])
-    test_node_pairs.npy
-    array([[8, 9],
-           [9, 0]])
-    test_negative_dsts.npy
-    array([[0, 1],
-           [0, 1]])
-.. note::
-    The values of ``name`` fields in the ``task`` such as ``seed_nodes``,
-    ``labels``, ``node_pairs`` and ``negative_dsts`` are mandatory. They are
-    used to specify the data fields of ``MiniBatch`` for sampling. The values
-    of ``name`` fields in the ``feature_data`` such as ``feat`` are user-defined.
-Heterogeneous graph
-----------------
-Data folder structure:
-^^^^^^^^^^^^^^^^^^^^^
-.. code::
-    data/
-      user_feat.npy
-      item_feat.npy
-      user_follow_user_feat.npy
-      user_click_item_feat.npy
-    edges/
-      user_follow_user.csv
-      user_click_item.csv
-    set_nc/
-      user_train_seed_nodes.npy
-      user_train_labels.npy
-      user_val_seed_nodes.npy
-      user_val_labels.npy
-      user_test_seed_nodes.npy
-      user_test_labels.npy
-    set_lp/
-      follow_train_node_pairs.npy
-      follow_val_node_pairs.npy
-      follow_val_negative_dsts.npy
-      follow_test_node_pairs.npy
-      follow_test_negative_dsts.npy
-    metadata.yaml
-``metadata.yaml`` file:
-^^^^^^^^^^^^^^^^^^^^^
-.. code:: yaml
-    dataset_name: heterogeneous_graph_nc_lp
-    graph:
-      nodes:
-        - type: user
-          num: 10
-        - type: item
-          num: 10
-      edges:
-        - type: "user:follow:user"
-          format: csv
-          path: edges/user_follow_user.csv
-        - type: "user:click:item"
-          format: csv
-          path: edges/user_click_item.csv
-    feature_data:
-      - domain: node
-        type: user
-        name: feat
-        format: numpy
-        in_memory: true
-        path: data/user_feat.npy
-      - domain: node
-        type: item
-        name: feat
-        format: numpy
-        in_memory: true
-        path: data/item_feat.npy
-      - domain: edge
-        type: "user:follow:user"
-        name: feat
-        format: numpy
-        in_memory: true
-        path: data/user_follow_user_feat.npy
-      - domain: edge
-        type: "user:click:item"
-        name: feat
-        format: numpy
-        in_memory: true
-        path: data/user_click_item_feat.npy
-    tasks:
-      - name: node_classification
-        num_classes: 2
-        train_set:
-          - type: user
-            data:
-              - name: seed_nodes
-                format: numpy
-                in_memory: true
-                path: set/user_train_seed_nodes.npy
-              - name: labels
-                format: numpy
-                in_memory: true
-                path: set/user_train_labels.npy
-        validation_set:
-          - type: user
-            data:
-              - name: seed_nodes
-                format: numpy
-                in_memory: true
-                path: set/user_val_seed_nodes.npy
-              - name: labels
-                format: numpy
-                in_memory: true
-                path: set/user_val_labels.npy
-        test_set:
-          - type: user
-            data:
-              - name: seed_nodes
-                format: numpy
-                in_memory: true
-                path: set/user_test_seed_nodes.npy
-              - name: labels
-                format: numpy
-                in_memory: true
-                path: set/user_test_labels.npy
-      - name: link_prediction
-        num_classes: 2
-        train_set:
-          - type: "user:follow:user"
-            data:
-              - name: node_pairs
-                format: numpy
-                in_memory: true
-                path: set/follow_train_node_pairs.npy
-        validation_set:
-          - type: "user:follow:user"
-            data:
-              - name: node_pairs
-                format: numpy
-                in_memory: true
-                path: set/follow_val_node_pairs.npy
-              - name: negative_dsts
-                format: numpy
-                in_memory: true
-                path: set/follow_val_negative_dsts.npy
-        test_set:
-          - type: "user:follow:user"
-            data:
-              - name: node_pairs
-                format: numpy
-                in_memory: true
-                path: set/follow_test_node_pairs.npy
-              - name: negative_dsts
-                format: numpy
-                in_memory: true
-                path: set/follow_test_negative_dsts.npy
-For the graph structure, we have two types of nodes: ``user`` and ``item``
-in above example. Number of each node type is specified by the ``num`` field.
-We have two types of edges: ``user:follow:user`` and ``user:click:item``.
-The edges are stored in two columns of csv files like below.
-.. code:: csv
-    user_follow_user.csv
-    0,1
-    1,2
-    2,3
-    3,4
-    4,5
-    5,6
-    6,7
-    7,8
-    8,9
-    user_click_item.csv
-    0,0
-    1,1
-    2,2
-    3,3
-    4,4
-    5,5
-    6,6
-    7,7
-    8,8
-    9,9
-For the feature data, we have feature data named as ``feat`` for nodes and
-edges. The feature data are stored in numpy files in shape of ``[num_nodes, 10]``
-and ``[num_edges, 10]`` respectively like below.
-.. code:: python
-    user_feat.npy
-    array([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
-           [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
-           [2., 2., 2., 2., 2., 2., 2., 2., 2., 2.],
-           [3., 3., 3., 3., 3., 3., 3., 3., 3., 3.],
-           [4., 4., 4., 4., 4., 4., 4., 4., 4., 4.],
-           [5., 5., 5., 5., 5., 5., 5., 5., 5., 5.],
-           [6., 6., 6., 6., 6., 6., 6., 6., 6., 6.],
-           [7., 7., 7., 7., 7., 7., 7., 7., 7., 7.],
-           [8., 8., 8., 8., 8., 8., 8., 8., 8., 8.],
-           [9., 9., 9., 9., 9., 9., 9., 9., 9., 9.]])
-    item_feat.npy
-    array([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
-           [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
-           [2., 2., 2., 2., 2., 2., 2., 2., 2., 2.],
-           [3., 3., 3., 3., 3., 3., 3., 3., 3., 3.],
-           [4., 4., 4., 4., 4., 4., 4., 4., 4., 4.],
-           [5., 5., 5., 5., 5., 5., 5., 5., 5., 5.],
-           [6., 6., 6., 6., 6., 6., 6., 6., 6., 6.],
-           [7., 7., 7., 7., 7., 7., 7., 7., 7., 7.],
-           [8., 8., 8., 8., 8., 8., 8., 8., 8., 8.],
-           [9., 9., 9., 9., 9., 9., 9., 9., 9., 9.]])
-    user_follow_user_feat.npy
-    array([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
-           [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
-           [2., 2., 2., 2., 2., 2., 2., 2., 2., 2.],
-           [3., 3., 3., 3., 3., 3., 3., 3., 3., 3.],
-           [4., 4., 4., 4., 4., 4., 4., 4., 4., 4.],
-           [5., 5., 5., 5., 5., 5., 5., 5., 5., 5.],
-           [6., 6., 6., 6., 6., 6., 6., 6., 6., 6.],
-           [7., 7., 7., 7., 7., 7., 7., 7., 7., 7.],
-           [8., 8., 8., 8., 8., 8., 8., 8., 8., 8.],
-           [9., 9., 9., 9., 9., 9., 9., 9., 9., 9.]])
-    user_click_item_feat.npy
-    array([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
-           [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
-           [2., 2., 2., 2., 2., 2., 2., 2., 2., 2.],
-           [3., 3., 3., 3., 3., 3., 3., 3., 3., 3.],
-           [4., 4., 4., 4., 4., 4., 4., 4., 4., 4.],
-           [5., 5., 5., 5., 5., 5., 5., 5., 5., 5.],
-           [6., 6., 6., 6., 6., 6., 6., 6., 6., 6.],
-           [7., 7., 7., 7., 7., 7., 7., 7., 7., 7.],
-           [8., 8., 8., 8., 8., 8., 8., 8., 8., 8.],
-           [9., 9., 9., 9., 9., 9., 9., 9., 9., 9.]])
-For the ``tasks`` field, we have two tasks: ``node_classification`` and
-``link_prediction``. For each task, we have three sets: ``train_set``,
-``validation_set`` and ``test_set``.
-For ``node_classification`` task, we have two fields: ``seed_nodes`` and
-``labels``. The ``seed_nodes`` field is used to specify the node IDs for
-training and evaluation. The ``labels`` field is used to specify the
-labels. Both of them are stored in numpy files with shape of ``[num_nodes]``
-like below.
-.. code:: python
-    user_train_seed_nodes.npy
-    array([0, 1, 2, 3, 4, 5]])
-    user_train_labels.npy
-    array([0, 1, 0, 1, 0, 1])
-    user_val_seed_nodes.npy
-    array([6, 7])
-    user_val_labels.npy
-    array([0, 1])
-    user_test_seed_nodes.npy
-    array([8, 9])
-    user_test_labels.npy
-    array([0, 1])
-For ``link_prediction`` task, we have two fields: ``node_pairs``,
-``negative_dsts``.  The ``node_pairs`` field is used to specify the node pairs.
-The ``negative_dsts`` field is used to specify the negative destination nodes.
-They are stored in numpy file with shape of ``[num_edges, 2]`` and
-``[num_edges, num_neg_dsts]`` respectively like below.
-.. code:: python
-    follow_train_node_pairs.npy
-    array([[0, 1]
-           [1, 2]
-           [2, 3]
-           [3, 4]
-           [4, 5]
-           [5, 6]]])
-    follow_val_node_pairs.npy
-    array([[6, 7]
-           [7, 8]])
-    follow_val_negative_dsts.npy
-    array([[8, 9]
-           [8, 9]]])
-    follow_test_node_pairs.npy
-    array([[8, 9]
-           [9, 0]]])
-    follow_test_negative_dsts.npy
-    array([[0, 1]
-           [0, 1]]])
-Full YAML specification
-----------------------
-The full YAML specification of ``metadata.yaml`` file is shown below.
-.. code:: yaml
-    dataset_name: <string>
-    graph:
-      nodes:
-        - type: <string>
-          num: <int>
-        - type: <string>
-          num: <int>
-      edges:
-        - type: <string>
-          format: <string>
-          path: <string>
-        - type: <string>
-          format: <string>
-          path: <string>
-    feature_data:
-      - domain: node
-        type: <string>
-        name: <string>
-        format: <string>
-        in_memory: <bool>
-        path: <string>
-      - domain: node
-        type: <string>
-        name: <string>
-        format: <string>
-        in_memory: <bool>
-        path: <string>
-      - domain: edge
-        type: <string>
-        name: <string>
-        format: <string>
-        in_memory: <bool>
-        path: <string>
-      - domain: edge
-        type: <string>
-        name: <string>
-        format: <string>
-        in_memory: <bool>
-        path: <string>
-    tasks:
-      - name: <string>
-        num_classes: <int>
-        train_set:
-          - type: <string>
-            data:
-              - name: <string>
-                format: <string>
-                in_memory: <bool>
-                path: <string>
-              - name: <string>
-                format: <string>
-                in_memory: <bool>
-                path: <string>
-        validation_set:
-          - type: <string>
-            data:
-              - name: <string>
-                format: <string>
-                in_memory: <bool>
-                path: <string>
-              - name: <string>
-                format: <string>
-                in_memory: <bool>
-                path: <string>
-        test_set:
-          - type: <string>
-            data:
-              - name: <string>
-                format: <string>
-                in_memory: <bool>
-                path: <string>
-              - name: <string>
-                format: <string>
-                in_memory: <bool>
-                path: <string>
-``dataset_name``
-^^^^^^^^^^^^^^^
-The ``dataset_name`` field is used to specify the name of the dataset. It is
-user-defined.
-``graph``
-^^^^^^^^
-The ``graph`` field is used to specify the graph structure. It has two fields:
-``nodes`` and ``edges``.
- - ``nodes``: ``list``
-   The ``nodes`` field is used to specify the number of nodes for each node type.
-   It is a list of ``node`` objects. Each ``node`` object has two fields: ``type``
-   and ``num``.
-    - ``type``: ``string``, optional
-      The ``type`` field is used to specify the node type. It is ``null`` for
-      homogeneous graphs. For heterogeneous graphs, it is the node type.
-    - ``num``: ``int``
-      The ``num`` field is used to specify the number of nodes for the node type.
-      It is mandatory for both homogeneous graphs and heterogeneous graphs.
-  - ``edges``: ``list``
-    The ``edges`` field is used to specify the edges. It is a list of ``edge``
-    objects. Each ``edge`` object has three fields: ``type``, ``format`` and
-    ``path``.
-    - ``type``: ``string``, optional
-      The ``type`` field is used to specify the edge type. It is ``null`` for
-      homogeneous graphs. For heterogeneous graphs, it is the edge type.
-    - ``format``: ``string``
-      The ``format`` field is used to specify the format of the edge data. It can
-      only be ``csv`` for now.
-    - ``path``: ``string``
-      The ``path`` field is used to specify the path of the edge data. It is
-      relative to the directory of ``metadata.yaml`` file.
-``feature_data``
-^^^^^^^^^^^^^^^
-The ``feature_data`` field is used to specify the feature data. It is a list of
-``feature`` objects. Each ``feature`` object has five canonical fields: ``domain``,
-``type``, ``name``, ``format`` and ``path``. Any other fields will be passed to
-the ``Feature.metadata`` object.
- - ``domain``: ``string``
-   The ``domain`` field is used to specify the domain of the feature data. It can
-   be either ``node`` or ``edge``.
- - ``type``: ``string``, optional
-   The ``type`` field is used to specify the type of the feature data. It is
-   ``null`` for homogeneous graphs. For heterogeneous graphs, it is the node or
-   edge type.
-  - ``name``: ``string``
-    The ``name`` field is used to specify the name of the feature data. It is
-    user-defined.
-  - ``format``: ``string``
-    The ``format`` field is used to specify the format of the feature data. It can
-    be either ``numpy`` or ``torch``.
-  - ``in_memory``: ``bool``, optional
-    The ``in_memory`` field is used to specify whether the feature data is loaded
-    into memory. It can be either ``true`` or ``false``. Default is ``true``.
-  - ``path``: ``string``
-    The ``path`` field is used to specify the path of the feature data. It is
-    relative to the directory of ``metadata.yaml`` file.
-``tasks``
-^^^^^^^^
-The ``tasks`` field is used to specify the tasks. It is a list of ``task``
-objects. Each ``task`` object has at least three fields: ``train_set``,
-``validation_set``, ``test_set``. And you are free to add other fields
-such as ``num_classes`` and all these fields will be passed to the
-``Task.metadata`` object.
- - ``name``: ``string``, optional
-   The ``name`` field is used to specify the name of the task. It is user-defined.
- - ``num_classes``: ``int``, optional
-    The ``num_classes`` field is used to specify the number of classes of the task.
- - ``train_set``: ``list``
-    The ``train_set`` field is used to specify the training set. It is a list of
-    ``set`` objects. Each ``set`` object has two fields: ``type`` and ``data``.
-  - ``type``: ``string``, optional
-      The ``type`` field is used to specify the node/edge type of the set. It is
-      ``null`` for homogeneous graphs. For heterogeneous graphs, it is the node
-      or edge type.
-  - ``data``: ``list``
-      The ``data`` field is used to load ``train_set``. It is a list of ``data``
-      objects. Each ``data`` object has four fields: ``name``, ``format``,
-      ``in_memory`` and ``path``.
-    - ``name``: ``string``
-        The ``name`` field is used to specify the name of the data. It is mandatory
-        and used to specify the data fields of ``MiniBatch`` for sampling. It can
-        be either ``seed_nodes``, ``labels``, ``node_pairs``, ``negative_srcs`` or 
-        ``negative_dsts``. If any other name is used, it will be added into the
-        ``MiniBatch`` data fields.
-    - ``format``: ``string``
-        The ``format`` field is used to specify the format of the data. It can be
-        either ``numpy`` or ``torch``.
-    - ``in_memory``: ``bool``, optional
-        The ``in_memory`` field is used to specify whether the data is loaded into
-        memory. It can be either ``true`` or ``false``. Default is ``true``.
-    - ``path``: ``string``
-        The ``path`` field is used to specify the path of the data. It is relative
-        to the directory of ``metadata.yaml`` file.
- - ``validation_set``: ``list``
- - ``test_set``: ``list``
-    The ``validation_set`` and ``test_set`` fields are used to specify the
-    validation set and test set respectively. They are similar to the
-    ``train_set`` field.
+    ondisk_dataset_homograph.nblink
+    ondisk_dataset_heterograph.nblink
+    ondisk-dataset-specification.rst
--- a/docs/source/stochastic_training/ondisk_dataset_heterograph.nblink
+++ b/docs/source/stochastic_training/ondisk_dataset_heterograph.nblink
+{
+    "path": "../../../notebooks/stochastic_training/ondisk_dataset_heterograph.ipynb"
+}
--- a/docs/source/stochastic_training/ondisk_dataset_homograph.nblink
+++ b/docs/source/stochastic_training/ondisk_dataset_homograph.nblink
+{
+    "path": "../../../notebooks/stochastic_training/ondisk_dataset_homograph.ipynb"
+}
--- a/notebooks/stochastic_training/ondisk_dataset_heterograph.ipynb
+++ b/notebooks/stochastic_training/ondisk_dataset_heterograph.ipynb
+{
+  "nbformat": 4,
+  "nbformat_minor": 0,
+  "metadata": {
+    "colab": {
+      "private_outputs": true,
+      "provenance": [],
+      "authorship_tag": "ABX9TyM1zJGR6lVdC9JfDbddFLpa"
+    },
+    "kernelspec": {
+      "name": "python3",
+      "display_name": "Python 3"
+    },
+    "language_info": {
+      "name": "python"
+    }
+  },
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# OnDiskDataset for Heterogeneous Graph\n",
+        "\n",
+        "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/dmlc/dgl/blob/master/notebooks/stochastic_training/ondisk_dataset_heterograph.ipynb) [![GitHub](https://img.shields.io/badge/-View%20on%20GitHub-181717?logo=github&logoColor=ffffff)](https://github.com/dmlc/dgl/blob/master/notebooks/stochastic_training/ondisk_dataset_heterograph.ipynb)\n",
+        "\n",
+        "This tutorial shows how to create `OnDiskDataset` for heterogeneous graph that could be used in **GraphBolt** framework.\n",
+        "\n",
+        "By the end of this tutorial, you will be able to\n",
+        "- organize graph structure data.\n",
+        "- organize feature data.\n",
+        "- organize training/validation/test set for specific tasks."
+      ],
+      "metadata": {
+        "id": "FnFhPMaAfLtJ"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "## Install DGL package"
+      ],
+      "metadata": {
+        "id": "Wlb19DtWgtzq"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "# Install required packages.\n",
+        "import os\n",
+        "import torch\n",
+        "import numpy as np\n",
+        "os.environ['TORCH'] = torch.__version__\n",
+        "os.environ['DGLBACKEND'] = \"pytorch\"\n",
+        "\n",
+        "# Install the CPU version.\n",
+        "device = torch.device(\"cpu\")\n",
+        "!pip install --pre dgl -f https://data.dgl.ai/wheels-test/repo.html\n",
+        "\n",
+        "try:\n",
+        "    import dgl\n",
+        "    import dgl.graphbolt as gb\n",
+        "    installed = True\n",
+        "except ImportError as error:\n",
+        "    installed = False\n",
+        "    print(error)\n",
+        "print(\"DGL installed!\" if installed else \"DGL not found!\")"
+      ],
+      "metadata": {
+        "id": "UojlT9ZGgyr9"
+      },
+      "execution_count": null,
+      "outputs": []
+    }
+  ]
+}
--- a/notebooks/stochastic_training/ondisk_dataset_homograph.ipynb
+++ b/notebooks/stochastic_training/ondisk_dataset_homograph.ipynb
+{
+  "nbformat": 4,
+  "nbformat_minor": 0,
+  "metadata": {
+    "colab": {
+      "private_outputs": true,
+      "provenance": [],
+      "authorship_tag": "ABX9TyMnOgpk68ZvpOQVFBgDxDof"
+    },
+    "kernelspec": {
+      "name": "python3",
+      "display_name": "Python 3"
+    },
+    "language_info": {
+      "name": "python"
+    }
+  },
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# OnDiskDataset for Homogeneous Graph\n",
+        "\n",
+        "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/dmlc/dgl/blob/master/notebooks/stochastic_training/ondisk_dataset_homograph.ipynb) [![GitHub](https://img.shields.io/badge/-View%20on%20GitHub-181717?logo=github&logoColor=ffffff)](https://github.com/dmlc/dgl/blob/master/notebooks/stochastic_training/ondisk_dataset_homograph.ipynb)\n",
+        "\n",
+        "This tutorial shows how to create `OnDiskDataset` for homogeneous graph that could be used in **GraphBolt** framework.\n",
+        "\n",
+        "By the end of this tutorial, you will be able to\n",
+        "- organize graph structure data.\n",
+        "- organize feature data.\n",
+        "- organize training/validation/test set for specific tasks."
+      ],
+      "metadata": {
+        "id": "FnFhPMaAfLtJ"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "## Install DGL package"
+      ],
+      "metadata": {
+        "id": "Wlb19DtWgtzq"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "# Install required packages.\n",
+        "import os\n",
+        "import torch\n",
+        "import numpy as np\n",
+        "os.environ['TORCH'] = torch.__version__\n",
+        "os.environ['DGLBACKEND'] = \"pytorch\"\n",
+        "\n",
+        "# Install the CPU version.\n",
+        "device = torch.device(\"cpu\")\n",
+        "!pip install --pre dgl -f https://data.dgl.ai/wheels-test/repo.html\n",
+        "\n",
+        "try:\n",
+        "    import dgl\n",
+        "    import dgl.graphbolt as gb\n",
+        "    installed = True\n",
+        "except ImportError as error:\n",
+        "    installed = False\n",
+        "    print(error)\n",
+        "print(\"DGL installed!\" if installed else \"DGL not found!\")"
+      ],
+      "metadata": {
+        "id": "UojlT9ZGgyr9"
+      },
+      "execution_count": null,
+      "outputs": []
+    }
+  ]
+}