[doc] add spec for OnDiskDataset (#6810)

7094ff4f · Rhett Ying · GitHub · ca675ed4 · 7094ff4f
Unverified Commit 7094ff4f authored Dec 22, 2023 by Rhett Ying Committed by GitHub Dec 22, 2023
Show whitespace changes
Inline Side-by-side

Showing with 9 additions and 620 deletions

docs/source/stochastic_training/ondisk-dataset-specification.rst ...urce/stochastic_training/ondisk-dataset-specification.rst +9 -620

No files found.
--- a/docs/source/stochastic_training/ondisk-dataset-specification.rst
+++ b/docs/source/stochastic_training/ondisk-dataset-specification.rst
 .. _stochastic_training-ondisk-dataset-specification:

-Prepare dataset
-===============
+YAML specification
+==================

-**GraphBolt** provides the ``OnDiskDataset`` class to help user organize plain
-data of graph strucutre, feature data and tasks. ``OnDiskDataset`` is also
-designed to efficiently handle large graphs and features that do not fit into
-memory by storing them on disk.
-
-To create an ``OnDiskDataset`` object, you need to organize all the data including
-graph structure, feature data and tasks into a directory. The directory should
-contain a ``metadata.yaml`` file that describes the metadata of the dataset.
-
-Then just pass the directory path to the ``OnDiskDataset`` constructor to create
-the dataset object.
-
-.. code:: python
-
-    from dgl.graphbolt import OnDiskDataset
-
-    dataset = OnDiskDataset('/path/to/dataset')
-
-The returned ``dataset`` object just loads the yaml file and does not load any
-data. To load the graph structure, feature data and tasks, you need to call
-the ``load`` method.
-
-.. code:: python
-
-    dataset.load()
-
-The reason why we separate the ``OnDiskDataset`` object creation and data loading
-is that you may want to change some fields in the ``metadata.yaml`` file before
-loading the data. For example, you may want to change the path of the feature
-data files to point to a different directory. In this case, you can just
-modify the path via ``dataset.yaml_data`` directly. Then call the ``load`` method
-again to load the data.
-
-After loading the data, you can access the graph structure, feature data and
-tasks through the ``graph``, ``feature`` and ``tasks`` attributes respectively.
-
-.. code:: python
-
-    graph = dataset.graph
-    feature = dataset.feature
-    tasks = dataset.tasks
-
-The returned ``graph`` is a ``FusedCSCSamplingGraph`` object, which will be used
-for sampling. The returned ``feature`` is a ``TorchBasedFeatureStore`` object,
-which will be used for feature lookup. The returned ``tasks`` is a list of
-``Task`` objects, which will be used for training and evaluation.
-
-The following examples show data folder structure and ``metadata.yaml`` file for
-homogeneous graphs and heterogeneous graphs respectively. If you want to know
-the full YAML specification, please refer to the `Full YAML specification`_ section.
-
-Homogeneous graph
-----------------
-
-Data folder structure:
-^^^^^^^^^^^^^^^^^^^^^
-
-.. code::
-
-    data/
-      node_feat.npy
-      edge_feat.npy
-    edges/
-      edges.csv
-    set_nc/
-      train_seed_nodes.npy
-      train_labels.npy
-      val_seed_nodes.npy
-      val_labels.npy
-      test_seed_nodes.npy
-      test_labels.npy
-    set_lp/
-      train_node_pairs.npy
-      val_node_pairs.npy
-      val_negative_dsts.npy
-      test_node_pairs.npy
-      test_negative_dsts.npy
-    metadata.yaml
-
-
-``metadata.yaml`` file:
-^^^^^^^^^^^^^^^^^^^^^
-
-.. code:: yaml
-
-    dataset_name: homogeneous_graph_nc_lp
-    graph:
-      nodes:
-        - num: 10
-      edges:
-        - format: csv
-          path: edges/edges.csv
-    feature_data:
-      - domain: node
-        name: feat
-        format: numpy
-        in_memory: true
-        path: data/node_feat.npy
-      - domain: edge
-        name: feat
-        format: numpy
-        in_memory: true
-        path: data/edge_feat.npy
-    tasks:
-      - name: node_classification
-        num_classes: 2
-        train_set:
-          - data:
-              - name: seed_nodes
-                format: numpy
-                in_memory: true
-                path: set_nc/train_seed_nodes.npy
-              - name: labels
-                format: numpy
-                in_memory: true
-                path: set_nc/train_labels.npy
-        validation_set:
-          - data:
-              - name: seed_nodes
-                format: numpy
-                in_memory: true
-                path: set_nc/val_seed_nodes.npy
-              - name: labels
-                format: numpy
-                in_memory: true
-                path: set_nc/val_labels.npy
-        test_set:
-          - data:
-              - name: seed_nodes
-                format: numpy
-                in_memory: true
-                path: set_nc/test_seed_nodes.npy
-              - name: labels
-                format: numpy
-                in_memory: true
-                path: set_nc/test_labels.npy
-      - name: link_prediction
-        num_classes: 2
-        train_set:
-          - data:
-              - name: node_pairs
-                format: numpy
-                in_memory: true
-                path: set_lp/train_node_pairs.npy
-        validation_set:
-          - data:
-              - name: node_pairs
-                format: numpy
-                in_memory: true
-                path: set_lp/val_node_pairs.npy
-              - name: negative_dsts
-                format: numpy
-                in_memory: true
-                path: set_lp/val_negative_dsts.npy
-        test_set:
-          - data:
-              - name: node_pairs
-                format: numpy
-                in_memory: true
-                path: set_lp/test_node_pairs.npy
-              - name: negative_dsts
-                format: numpy
-                in_memory: true
-                path: set_lp/test_negative_dsts.npy
-
-
-For the graph structure, number of nodes is specified by the ``num`` field and
-edges are stored in a csv file in format of ``<src, dst>`` like below.
-
-.. code:: csv
-
-    edges.csv
-
-    0,1
-    1,2
-    2,3
-    3,4
-    4,5
-    5,6
-    6,7
-    7,8
-    8,9
-
-
-For the feature data, we have feature data named as ``feat`` for nodes and
-edges. The feature data are stored in numpy files in shape of ``[num_nodes, 10]``
-and ``[num_edges, 10]`` respectively like below.
-
-.. code:: python
-
-    node_feat.npy
-
-    array([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
-           [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
-           [2., 2., 2., 2., 2., 2., 2., 2., 2., 2.],
-           [3., 3., 3., 3., 3., 3., 3., 3., 3., 3.],
-           [4., 4., 4., 4., 4., 4., 4., 4., 4., 4.],
-           [5., 5., 5., 5., 5., 5., 5., 5., 5., 5.],
-           [6., 6., 6., 6., 6., 6., 6., 6., 6., 6.],
-           [7., 7., 7., 7., 7., 7., 7., 7., 7., 7.],
-           [8., 8., 8., 8., 8., 8., 8., 8., 8., 8.],
-           [9., 9., 9., 9., 9., 9., 9., 9., 9., 9.]])
-
-    edge_feat.npy
-
-    array([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
-           [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
-           [2., 2., 2., 2., 2., 2., 2., 2., 2., 2.],
-           [3., 3., 3., 3., 3., 3., 3., 3., 3., 3.],
-           [4., 4., 4., 4., 4., 4., 4., 4., 4., 4.],
-           [5., 5., 5., 5., 5., 5., 5., 5., 5., 5.],
-           [6., 6., 6., 6., 6., 6., 6., 6., 6., 6.],
-           [7., 7., 7., 7., 7., 7., 7., 7., 7., 7.],
-           [8., 8., 8., 8., 8., 8., 8., 8., 8., 8.],
-           [9., 9., 9., 9., 9., 9., 9., 9., 9., 9.]])
-
-
-For the ``tasks`` field, we have two tasks: ``node_classification`` and
-``link_prediction``. For each task, we have three sets: ``train_set``,
-``validation_set`` and ``test_set``.
-
-For ``node_classification`` task, we have two fields: ``seed_nodes`` and
-``labels``. The ``seed_nodes`` field is used to specify the node IDs for
-training and evaluation. The ``labels`` field is used to specify the
-labels. Both of them are stored in numpy files with shape of ``[num_nodes]``
-like below.
-
-.. code:: python
-
-    train_seed_nodes.npy
-
-    array([0, 1, 2, 3, 4, 5])
-
-    train_labels.npy
-
-    array([0, 1, 0, 1, 0, 1])
-
-    val_seed_nodes.npy
-
-    array([6, 7])
-
-    val_labels.npy
-
-    array([0, 1])
-
-    test_seed_nodes.npy
-
-    array([8, 9])
-
-    test_labels.npy
-
-    array([0, 1])
-
-
-For ``link_prediction`` task, we have two fields: ``node_pairs``,
-``negative_dsts``.  The ``node_pairs`` field is used to specify the node pairs.
-The ``negative_dsts`` field is used to specify the negative destination nodes.
-They are stored in numpy file with shape of ``[num_edges, 2]`` and
-``[num_edges, num_neg_dsts]`` respectively like below.
-
-.. code:: python
-
-    train_node_pairs.npy
-
-    array([[0, 1],
-           [1, 2],
-           [2, 3],
-           [3, 4],
-           [4, 5],
-           [5, 6]])
-
-    val_node_pairs.npy
-
-    array([[6, 7],
-           [7, 8]])
-
-    val_negative_dsts.npy
-
-    array([[8, 9],
-           [8, 9]])
-
-    test_node_pairs.npy
-
-    array([[8, 9],
-           [9, 0]])
-
-    test_negative_dsts.npy
-
-    array([[0, 1],
-           [0, 1]])
-
-
-.. note::
-
-    The values of ``name`` fields in the ``task`` such as ``seed_nodes``,
-    ``labels``, ``node_pairs`` and ``negative_dsts`` are mandatory. They are
-    used to specify the data fields of ``MiniBatch`` for sampling. The values
-    of ``name`` fields in the ``feature_data`` such as ``feat`` are user-defined.
-
-
-Heterogeneous graph
-----------------
-
-Data folder structure:
-^^^^^^^^^^^^^^^^^^^^^
-
-.. code::
-
-    data/
-      user_feat.npy
-      item_feat.npy
-      user_follow_user_feat.npy
-      user_click_item_feat.npy
-    edges/
-      user_follow_user.csv
-      user_click_item.csv
-    set_nc/
-      user_train_seed_nodes.npy
-      user_train_labels.npy
-      user_val_seed_nodes.npy
-      user_val_labels.npy
-      user_test_seed_nodes.npy
-      user_test_labels.npy
-    set_lp/
-      follow_train_node_pairs.npy
-      follow_val_node_pairs.npy
-      follow_val_negative_dsts.npy
-      follow_test_node_pairs.npy
-      follow_test_negative_dsts.npy
-    metadata.yaml
-
-
-``metadata.yaml`` file:
-^^^^^^^^^^^^^^^^^^^^^
-
-.. code:: yaml
-
-    dataset_name: heterogeneous_graph_nc_lp
-    graph:
-      nodes:
-        - type: user
-          num: 10
-        - type: item
-          num: 10
-      edges:
-        - type: "user:follow:user"
-          format: csv
-          path: edges/user_follow_user.csv
-        - type: "user:click:item"
-          format: csv
-          path: edges/user_click_item.csv
-    feature_data:
-      - domain: node
-        type: user
-        name: feat
-        format: numpy
-        in_memory: true
-        path: data/user_feat.npy
-      - domain: node
-        type: item
-        name: feat
-        format: numpy
-        in_memory: true
-        path: data/item_feat.npy
-      - domain: edge
-        type: "user:follow:user"
-        name: feat
-        format: numpy
-        in_memory: true
-        path: data/user_follow_user_feat.npy
-      - domain: edge
-        type: "user:click:item"
-        name: feat
-        format: numpy
-        in_memory: true
-        path: data/user_click_item_feat.npy
-    tasks:
-      - name: node_classification
-        num_classes: 2
-        train_set:
-          - type: user
-            data:
-              - name: seed_nodes
-                format: numpy
-                in_memory: true
-                path: set/user_train_seed_nodes.npy
-              - name: labels
-                format: numpy
-                in_memory: true
-                path: set/user_train_labels.npy
-        validation_set:
-          - type: user
-            data:
-              - name: seed_nodes
-                format: numpy
-                in_memory: true
-                path: set/user_val_seed_nodes.npy
-              - name: labels
-                format: numpy
-                in_memory: true
-                path: set/user_val_labels.npy
-        test_set:
-          - type: user
-            data:
-              - name: seed_nodes
-                format: numpy
-                in_memory: true
-                path: set/user_test_seed_nodes.npy
-              - name: labels
-                format: numpy
-                in_memory: true
-                path: set/user_test_labels.npy
-      - name: link_prediction
-        num_classes: 2
-        train_set:
-          - type: "user:follow:user"
-            data:
-              - name: node_pairs
-                format: numpy
-                in_memory: true
-                path: set/follow_train_node_pairs.npy
-        validation_set:
-          - type: "user:follow:user"
-            data:
-              - name: node_pairs
-                format: numpy
-                in_memory: true
-                path: set/follow_val_node_pairs.npy
-              - name: negative_dsts
-                format: numpy
-                in_memory: true
-                path: set/follow_val_negative_dsts.npy
-        test_set:
-          - type: "user:follow:user"
-            data:
-              - name: node_pairs
-                format: numpy
-                in_memory: true
-                path: set/follow_test_node_pairs.npy
-              - name: negative_dsts
-                format: numpy
-                in_memory: true
-                path: set/follow_test_negative_dsts.npy
-
-For the graph structure, we have two types of nodes: ``user`` and ``item``
-in above example. Number of each node type is specified by the ``num`` field.
-We have two types of edges: ``user:follow:user`` and ``user:click:item``.
-The edges are stored in two columns of csv files like below.
-
-.. code:: csv
-
-    user_follow_user.csv
-
-    0,1
-    1,2
-    2,3
-    3,4
-    4,5
-    5,6
-    6,7
-    7,8
-    8,9
-
-    user_click_item.csv
-
-    0,0
-    1,1
-    2,2
-    3,3
-    4,4
-    5,5
-    6,6
-    7,7
-    8,8
-    9,9
-
-For the feature data, we have feature data named as ``feat`` for nodes and
-edges. The feature data are stored in numpy files in shape of ``[num_nodes, 10]``
-and ``[num_edges, 10]`` respectively like below.
-
-.. code:: python
-
-    user_feat.npy
-
-    array([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
-           [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
-           [2., 2., 2., 2., 2., 2., 2., 2., 2., 2.],
-           [3., 3., 3., 3., 3., 3., 3., 3., 3., 3.],
-           [4., 4., 4., 4., 4., 4., 4., 4., 4., 4.],
-           [5., 5., 5., 5., 5., 5., 5., 5., 5., 5.],
-           [6., 6., 6., 6., 6., 6., 6., 6., 6., 6.],
-           [7., 7., 7., 7., 7., 7., 7., 7., 7., 7.],
-           [8., 8., 8., 8., 8., 8., 8., 8., 8., 8.],
-           [9., 9., 9., 9., 9., 9., 9., 9., 9., 9.]])
-
-    item_feat.npy
-
-    array([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
-           [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
-           [2., 2., 2., 2., 2., 2., 2., 2., 2., 2.],
-           [3., 3., 3., 3., 3., 3., 3., 3., 3., 3.],
-           [4., 4., 4., 4., 4., 4., 4., 4., 4., 4.],
-           [5., 5., 5., 5., 5., 5., 5., 5., 5., 5.],
-           [6., 6., 6., 6., 6., 6., 6., 6., 6., 6.],
-           [7., 7., 7., 7., 7., 7., 7., 7., 7., 7.],
-           [8., 8., 8., 8., 8., 8., 8., 8., 8., 8.],
-           [9., 9., 9., 9., 9., 9., 9., 9., 9., 9.]])
-
-    user_follow_user_feat.npy
-
-    array([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
-           [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
-           [2., 2., 2., 2., 2., 2., 2., 2., 2., 2.],
-           [3., 3., 3., 3., 3., 3., 3., 3., 3., 3.],
-           [4., 4., 4., 4., 4., 4., 4., 4., 4., 4.],
-           [5., 5., 5., 5., 5., 5., 5., 5., 5., 5.],
-           [6., 6., 6., 6., 6., 6., 6., 6., 6., 6.],
-           [7., 7., 7., 7., 7., 7., 7., 7., 7., 7.],
-           [8., 8., 8., 8., 8., 8., 8., 8., 8., 8.],
-           [9., 9., 9., 9., 9., 9., 9., 9., 9., 9.]])
-
-    user_click_item_feat.npy
-
-    array([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
-           [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
-           [2., 2., 2., 2., 2., 2., 2., 2., 2., 2.],
-           [3., 3., 3., 3., 3., 3., 3., 3., 3., 3.],
-           [4., 4., 4., 4., 4., 4., 4., 4., 4., 4.],
-           [5., 5., 5., 5., 5., 5., 5., 5., 5., 5.],
-           [6., 6., 6., 6., 6., 6., 6., 6., 6., 6.],
-           [7., 7., 7., 7., 7., 7., 7., 7., 7., 7.],
-           [8., 8., 8., 8., 8., 8., 8., 8., 8., 8.],
-           [9., 9., 9., 9., 9., 9., 9., 9., 9., 9.]])
-
-
-For the ``tasks`` field, we have two tasks: ``node_classification`` and
-``link_prediction``. For each task, we have three sets: ``train_set``,
-``validation_set`` and ``test_set``.
-
-For ``node_classification`` task, we have two fields: ``seed_nodes`` and
-``labels``. The ``seed_nodes`` field is used to specify the node IDs for
-training and evaluation. The ``labels`` field is used to specify the
-labels. Both of them are stored in numpy files with shape of ``[num_nodes]``
-like below.
-
-.. code:: python
-
-    user_train_seed_nodes.npy
-
-    array([0, 1, 2, 3, 4, 5]])
-
-    user_train_labels.npy
-
-    array([0, 1, 0, 1, 0, 1])
-
-    user_val_seed_nodes.npy
-
-    array([6, 7])
-
-    user_val_labels.npy
-
-    array([0, 1])
-
-    user_test_seed_nodes.npy
-
-    array([8, 9])
-
-    user_test_labels.npy
-
-    array([0, 1])
-
-
-For ``link_prediction`` task, we have two fields: ``node_pairs``,
-``negative_dsts``.  The ``node_pairs`` field is used to specify the node pairs.
-The ``negative_dsts`` field is used to specify the negative destination nodes.
-They are stored in numpy file with shape of ``[num_edges, 2]`` and
-``[num_edges, num_neg_dsts]`` respectively like below.
-
-.. code:: python
-
-    follow_train_node_pairs.npy
-
-    array([[0, 1]
-           [1, 2]
-           [2, 3]
-           [3, 4]
-           [4, 5]
-           [5, 6]]])
-
-    follow_val_node_pairs.npy
-
-    array([[6, 7]
-           [7, 8]])
-
-    follow_val_negative_dsts.npy
-
-    array([[8, 9]
-           [8, 9]]])
-
-    follow_test_node_pairs.npy
-
-    array([[8, 9]
-           [9, 0]]])
-
-    follow_test_negative_dsts.npy
-
-    array([[0, 1]
-           [0, 1]]])
-
-
-Full YAML specification
-----------------------
-
-The full YAML specification of ``metadata.yaml`` file is shown below.
+This document describes the YAML specification of ``metadata.yaml`` file for
+``OnDiskDataset``. ``metadata.yaml`` file is used to specify the dataset
+information, including the graph structure, feature data and tasks.

 .. code:: yaml

@@ -697,13 +86,13 @@ The full YAML specification of ``metadata.yaml`` file is shown below.
                path: <string>

 ``dataset_name``
-^^^^^^^^^^^^^^^
+---------------

 The ``dataset_name`` field is used to specify the name of the dataset. It is
 user-defined.

 ``graph``
-^^^^^^^^
+---------

 The ``graph`` field is used to specify the graph structure. It has two fields:
 ``nodes`` and ``edges``.
@@ -742,7 +131,7 @@ The ``graph`` field is used to specify the graph structure. It has two fields:


 ``feature_data``
-^^^^^^^^^^^^^^^
+----------------

 The ``feature_data`` field is used to specify the feature data. It is a list of
 ``feature`` objects. Each ``feature`` object has five canonical fields: ``domain``,
@@ -777,7 +166,7 @@ the ``Feature.metadata`` object.


 ``tasks``
-^^^^^^^^
+---------

 The ``tasks`` field is used to specify the tasks. It is a list of ``task``
 objects. Each ``task`` object has at least three fields: ``train_set``,