Unverified Commit 7094ff4f authored by Rhett Ying's avatar Rhett Ying Committed by GitHub
Browse files

[doc] add spec for OnDiskDataset (#6810)

parent ca675ed4
.. _stochastic_training-ondisk-dataset-specification: .. _stochastic_training-ondisk-dataset-specification:
Prepare dataset YAML specification
=============== ==================
**GraphBolt** provides the ``OnDiskDataset`` class to help user organize plain This document describes the YAML specification of ``metadata.yaml`` file for
data of graph strucutre, feature data and tasks. ``OnDiskDataset`` is also ``OnDiskDataset``. ``metadata.yaml`` file is used to specify the dataset
designed to efficiently handle large graphs and features that do not fit into information, including the graph structure, feature data and tasks.
memory by storing them on disk.
To create an ``OnDiskDataset`` object, you need to organize all the data including
graph structure, feature data and tasks into a directory. The directory should
contain a ``metadata.yaml`` file that describes the metadata of the dataset.
Then just pass the directory path to the ``OnDiskDataset`` constructor to create
the dataset object.
.. code:: python
from dgl.graphbolt import OnDiskDataset
dataset = OnDiskDataset('/path/to/dataset')
The returned ``dataset`` object just loads the yaml file and does not load any
data. To load the graph structure, feature data and tasks, you need to call
the ``load`` method.
.. code:: python
dataset.load()
The reason why we separate the ``OnDiskDataset`` object creation and data loading
is that you may want to change some fields in the ``metadata.yaml`` file before
loading the data. For example, you may want to change the path of the feature
data files to point to a different directory. In this case, you can just
modify the path via ``dataset.yaml_data`` directly. Then call the ``load`` method
again to load the data.
After loading the data, you can access the graph structure, feature data and
tasks through the ``graph``, ``feature`` and ``tasks`` attributes respectively.
.. code:: python
graph = dataset.graph
feature = dataset.feature
tasks = dataset.tasks
The returned ``graph`` is a ``FusedCSCSamplingGraph`` object, which will be used
for sampling. The returned ``feature`` is a ``TorchBasedFeatureStore`` object,
which will be used for feature lookup. The returned ``tasks`` is a list of
``Task`` objects, which will be used for training and evaluation.
The following examples show data folder structure and ``metadata.yaml`` file for
homogeneous graphs and heterogeneous graphs respectively. If you want to know
the full YAML specification, please refer to the `Full YAML specification`_ section.
Homogeneous graph
-----------------
Data folder structure:
^^^^^^^^^^^^^^^^^^^^^
.. code::
data/
node_feat.npy
edge_feat.npy
edges/
edges.csv
set_nc/
train_seed_nodes.npy
train_labels.npy
val_seed_nodes.npy
val_labels.npy
test_seed_nodes.npy
test_labels.npy
set_lp/
train_node_pairs.npy
val_node_pairs.npy
val_negative_dsts.npy
test_node_pairs.npy
test_negative_dsts.npy
metadata.yaml
``metadata.yaml`` file:
^^^^^^^^^^^^^^^^^^^^^
.. code:: yaml
dataset_name: homogeneous_graph_nc_lp
graph:
nodes:
- num: 10
edges:
- format: csv
path: edges/edges.csv
feature_data:
- domain: node
name: feat
format: numpy
in_memory: true
path: data/node_feat.npy
- domain: edge
name: feat
format: numpy
in_memory: true
path: data/edge_feat.npy
tasks:
- name: node_classification
num_classes: 2
train_set:
- data:
- name: seed_nodes
format: numpy
in_memory: true
path: set_nc/train_seed_nodes.npy
- name: labels
format: numpy
in_memory: true
path: set_nc/train_labels.npy
validation_set:
- data:
- name: seed_nodes
format: numpy
in_memory: true
path: set_nc/val_seed_nodes.npy
- name: labels
format: numpy
in_memory: true
path: set_nc/val_labels.npy
test_set:
- data:
- name: seed_nodes
format: numpy
in_memory: true
path: set_nc/test_seed_nodes.npy
- name: labels
format: numpy
in_memory: true
path: set_nc/test_labels.npy
- name: link_prediction
num_classes: 2
train_set:
- data:
- name: node_pairs
format: numpy
in_memory: true
path: set_lp/train_node_pairs.npy
validation_set:
- data:
- name: node_pairs
format: numpy
in_memory: true
path: set_lp/val_node_pairs.npy
- name: negative_dsts
format: numpy
in_memory: true
path: set_lp/val_negative_dsts.npy
test_set:
- data:
- name: node_pairs
format: numpy
in_memory: true
path: set_lp/test_node_pairs.npy
- name: negative_dsts
format: numpy
in_memory: true
path: set_lp/test_negative_dsts.npy
For the graph structure, number of nodes is specified by the ``num`` field and
edges are stored in a csv file in format of ``<src, dst>`` like below.
.. code:: csv
edges.csv
0,1
1,2
2,3
3,4
4,5
5,6
6,7
7,8
8,9
For the feature data, we have feature data named as ``feat`` for nodes and
edges. The feature data are stored in numpy files in shape of ``[num_nodes, 10]``
and ``[num_edges, 10]`` respectively like below.
.. code:: python
node_feat.npy
array([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
[2., 2., 2., 2., 2., 2., 2., 2., 2., 2.],
[3., 3., 3., 3., 3., 3., 3., 3., 3., 3.],
[4., 4., 4., 4., 4., 4., 4., 4., 4., 4.],
[5., 5., 5., 5., 5., 5., 5., 5., 5., 5.],
[6., 6., 6., 6., 6., 6., 6., 6., 6., 6.],
[7., 7., 7., 7., 7., 7., 7., 7., 7., 7.],
[8., 8., 8., 8., 8., 8., 8., 8., 8., 8.],
[9., 9., 9., 9., 9., 9., 9., 9., 9., 9.]])
edge_feat.npy
array([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
[2., 2., 2., 2., 2., 2., 2., 2., 2., 2.],
[3., 3., 3., 3., 3., 3., 3., 3., 3., 3.],
[4., 4., 4., 4., 4., 4., 4., 4., 4., 4.],
[5., 5., 5., 5., 5., 5., 5., 5., 5., 5.],
[6., 6., 6., 6., 6., 6., 6., 6., 6., 6.],
[7., 7., 7., 7., 7., 7., 7., 7., 7., 7.],
[8., 8., 8., 8., 8., 8., 8., 8., 8., 8.],
[9., 9., 9., 9., 9., 9., 9., 9., 9., 9.]])
For the ``tasks`` field, we have two tasks: ``node_classification`` and
``link_prediction``. For each task, we have three sets: ``train_set``,
``validation_set`` and ``test_set``.
For ``node_classification`` task, we have two fields: ``seed_nodes`` and
``labels``. The ``seed_nodes`` field is used to specify the node IDs for
training and evaluation. The ``labels`` field is used to specify the
labels. Both of them are stored in numpy files with shape of ``[num_nodes]``
like below.
.. code:: python
train_seed_nodes.npy
array([0, 1, 2, 3, 4, 5])
train_labels.npy
array([0, 1, 0, 1, 0, 1])
val_seed_nodes.npy
array([6, 7])
val_labels.npy
array([0, 1])
test_seed_nodes.npy
array([8, 9])
test_labels.npy
array([0, 1])
For ``link_prediction`` task, we have two fields: ``node_pairs``,
``negative_dsts``. The ``node_pairs`` field is used to specify the node pairs.
The ``negative_dsts`` field is used to specify the negative destination nodes.
They are stored in numpy file with shape of ``[num_edges, 2]`` and
``[num_edges, num_neg_dsts]`` respectively like below.
.. code:: python
train_node_pairs.npy
array([[0, 1],
[1, 2],
[2, 3],
[3, 4],
[4, 5],
[5, 6]])
val_node_pairs.npy
array([[6, 7],
[7, 8]])
val_negative_dsts.npy
array([[8, 9],
[8, 9]])
test_node_pairs.npy
array([[8, 9],
[9, 0]])
test_negative_dsts.npy
array([[0, 1],
[0, 1]])
.. note::
The values of ``name`` fields in the ``task`` such as ``seed_nodes``,
``labels``, ``node_pairs`` and ``negative_dsts`` are mandatory. They are
used to specify the data fields of ``MiniBatch`` for sampling. The values
of ``name`` fields in the ``feature_data`` such as ``feat`` are user-defined.
Heterogeneous graph
-----------------
Data folder structure:
^^^^^^^^^^^^^^^^^^^^^
.. code::
data/
user_feat.npy
item_feat.npy
user_follow_user_feat.npy
user_click_item_feat.npy
edges/
user_follow_user.csv
user_click_item.csv
set_nc/
user_train_seed_nodes.npy
user_train_labels.npy
user_val_seed_nodes.npy
user_val_labels.npy
user_test_seed_nodes.npy
user_test_labels.npy
set_lp/
follow_train_node_pairs.npy
follow_val_node_pairs.npy
follow_val_negative_dsts.npy
follow_test_node_pairs.npy
follow_test_negative_dsts.npy
metadata.yaml
``metadata.yaml`` file:
^^^^^^^^^^^^^^^^^^^^^
.. code:: yaml
dataset_name: heterogeneous_graph_nc_lp
graph:
nodes:
- type: user
num: 10
- type: item
num: 10
edges:
- type: "user:follow:user"
format: csv
path: edges/user_follow_user.csv
- type: "user:click:item"
format: csv
path: edges/user_click_item.csv
feature_data:
- domain: node
type: user
name: feat
format: numpy
in_memory: true
path: data/user_feat.npy
- domain: node
type: item
name: feat
format: numpy
in_memory: true
path: data/item_feat.npy
- domain: edge
type: "user:follow:user"
name: feat
format: numpy
in_memory: true
path: data/user_follow_user_feat.npy
- domain: edge
type: "user:click:item"
name: feat
format: numpy
in_memory: true
path: data/user_click_item_feat.npy
tasks:
- name: node_classification
num_classes: 2
train_set:
- type: user
data:
- name: seed_nodes
format: numpy
in_memory: true
path: set/user_train_seed_nodes.npy
- name: labels
format: numpy
in_memory: true
path: set/user_train_labels.npy
validation_set:
- type: user
data:
- name: seed_nodes
format: numpy
in_memory: true
path: set/user_val_seed_nodes.npy
- name: labels
format: numpy
in_memory: true
path: set/user_val_labels.npy
test_set:
- type: user
data:
- name: seed_nodes
format: numpy
in_memory: true
path: set/user_test_seed_nodes.npy
- name: labels
format: numpy
in_memory: true
path: set/user_test_labels.npy
- name: link_prediction
num_classes: 2
train_set:
- type: "user:follow:user"
data:
- name: node_pairs
format: numpy
in_memory: true
path: set/follow_train_node_pairs.npy
validation_set:
- type: "user:follow:user"
data:
- name: node_pairs
format: numpy
in_memory: true
path: set/follow_val_node_pairs.npy
- name: negative_dsts
format: numpy
in_memory: true
path: set/follow_val_negative_dsts.npy
test_set:
- type: "user:follow:user"
data:
- name: node_pairs
format: numpy
in_memory: true
path: set/follow_test_node_pairs.npy
- name: negative_dsts
format: numpy
in_memory: true
path: set/follow_test_negative_dsts.npy
For the graph structure, we have two types of nodes: ``user`` and ``item``
in above example. Number of each node type is specified by the ``num`` field.
We have two types of edges: ``user:follow:user`` and ``user:click:item``.
The edges are stored in two columns of csv files like below.
.. code:: csv
user_follow_user.csv
0,1
1,2
2,3
3,4
4,5
5,6
6,7
7,8
8,9
user_click_item.csv
0,0
1,1
2,2
3,3
4,4
5,5
6,6
7,7
8,8
9,9
For the feature data, we have feature data named as ``feat`` for nodes and
edges. The feature data are stored in numpy files in shape of ``[num_nodes, 10]``
and ``[num_edges, 10]`` respectively like below.
.. code:: python
user_feat.npy
array([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
[2., 2., 2., 2., 2., 2., 2., 2., 2., 2.],
[3., 3., 3., 3., 3., 3., 3., 3., 3., 3.],
[4., 4., 4., 4., 4., 4., 4., 4., 4., 4.],
[5., 5., 5., 5., 5., 5., 5., 5., 5., 5.],
[6., 6., 6., 6., 6., 6., 6., 6., 6., 6.],
[7., 7., 7., 7., 7., 7., 7., 7., 7., 7.],
[8., 8., 8., 8., 8., 8., 8., 8., 8., 8.],
[9., 9., 9., 9., 9., 9., 9., 9., 9., 9.]])
item_feat.npy
array([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
[2., 2., 2., 2., 2., 2., 2., 2., 2., 2.],
[3., 3., 3., 3., 3., 3., 3., 3., 3., 3.],
[4., 4., 4., 4., 4., 4., 4., 4., 4., 4.],
[5., 5., 5., 5., 5., 5., 5., 5., 5., 5.],
[6., 6., 6., 6., 6., 6., 6., 6., 6., 6.],
[7., 7., 7., 7., 7., 7., 7., 7., 7., 7.],
[8., 8., 8., 8., 8., 8., 8., 8., 8., 8.],
[9., 9., 9., 9., 9., 9., 9., 9., 9., 9.]])
user_follow_user_feat.npy
array([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
[2., 2., 2., 2., 2., 2., 2., 2., 2., 2.],
[3., 3., 3., 3., 3., 3., 3., 3., 3., 3.],
[4., 4., 4., 4., 4., 4., 4., 4., 4., 4.],
[5., 5., 5., 5., 5., 5., 5., 5., 5., 5.],
[6., 6., 6., 6., 6., 6., 6., 6., 6., 6.],
[7., 7., 7., 7., 7., 7., 7., 7., 7., 7.],
[8., 8., 8., 8., 8., 8., 8., 8., 8., 8.],
[9., 9., 9., 9., 9., 9., 9., 9., 9., 9.]])
user_click_item_feat.npy
array([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
[2., 2., 2., 2., 2., 2., 2., 2., 2., 2.],
[3., 3., 3., 3., 3., 3., 3., 3., 3., 3.],
[4., 4., 4., 4., 4., 4., 4., 4., 4., 4.],
[5., 5., 5., 5., 5., 5., 5., 5., 5., 5.],
[6., 6., 6., 6., 6., 6., 6., 6., 6., 6.],
[7., 7., 7., 7., 7., 7., 7., 7., 7., 7.],
[8., 8., 8., 8., 8., 8., 8., 8., 8., 8.],
[9., 9., 9., 9., 9., 9., 9., 9., 9., 9.]])
For the ``tasks`` field, we have two tasks: ``node_classification`` and
``link_prediction``. For each task, we have three sets: ``train_set``,
``validation_set`` and ``test_set``.
For ``node_classification`` task, we have two fields: ``seed_nodes`` and
``labels``. The ``seed_nodes`` field is used to specify the node IDs for
training and evaluation. The ``labels`` field is used to specify the
labels. Both of them are stored in numpy files with shape of ``[num_nodes]``
like below.
.. code:: python
user_train_seed_nodes.npy
array([0, 1, 2, 3, 4, 5]])
user_train_labels.npy
array([0, 1, 0, 1, 0, 1])
user_val_seed_nodes.npy
array([6, 7])
user_val_labels.npy
array([0, 1])
user_test_seed_nodes.npy
array([8, 9])
user_test_labels.npy
array([0, 1])
For ``link_prediction`` task, we have two fields: ``node_pairs``,
``negative_dsts``. The ``node_pairs`` field is used to specify the node pairs.
The ``negative_dsts`` field is used to specify the negative destination nodes.
They are stored in numpy file with shape of ``[num_edges, 2]`` and
``[num_edges, num_neg_dsts]`` respectively like below.
.. code:: python
follow_train_node_pairs.npy
array([[0, 1]
[1, 2]
[2, 3]
[3, 4]
[4, 5]
[5, 6]]])
follow_val_node_pairs.npy
array([[6, 7]
[7, 8]])
follow_val_negative_dsts.npy
array([[8, 9]
[8, 9]]])
follow_test_node_pairs.npy
array([[8, 9]
[9, 0]]])
follow_test_negative_dsts.npy
array([[0, 1]
[0, 1]]])
Full YAML specification
-----------------------
The full YAML specification of ``metadata.yaml`` file is shown below.
.. code:: yaml .. code:: yaml
...@@ -697,13 +86,13 @@ The full YAML specification of ``metadata.yaml`` file is shown below. ...@@ -697,13 +86,13 @@ The full YAML specification of ``metadata.yaml`` file is shown below.
path: <string> path: <string>
``dataset_name`` ``dataset_name``
^^^^^^^^^^^^^^^ ---------------
The ``dataset_name`` field is used to specify the name of the dataset. It is The ``dataset_name`` field is used to specify the name of the dataset. It is
user-defined. user-defined.
``graph`` ``graph``
^^^^^^^^ ---------
The ``graph`` field is used to specify the graph structure. It has two fields: The ``graph`` field is used to specify the graph structure. It has two fields:
``nodes`` and ``edges``. ``nodes`` and ``edges``.
...@@ -742,7 +131,7 @@ The ``graph`` field is used to specify the graph structure. It has two fields: ...@@ -742,7 +131,7 @@ The ``graph`` field is used to specify the graph structure. It has two fields:
``feature_data`` ``feature_data``
^^^^^^^^^^^^^^^ ----------------
The ``feature_data`` field is used to specify the feature data. It is a list of The ``feature_data`` field is used to specify the feature data. It is a list of
``feature`` objects. Each ``feature`` object has five canonical fields: ``domain``, ``feature`` objects. Each ``feature`` object has five canonical fields: ``domain``,
...@@ -777,7 +166,7 @@ the ``Feature.metadata`` object. ...@@ -777,7 +166,7 @@ the ``Feature.metadata`` object.
``tasks`` ``tasks``
^^^^^^^^ ---------
The ``tasks`` field is used to specify the tasks. It is a list of ``task`` The ``tasks`` field is used to specify the tasks. It is a list of ``task``
objects. Each ``task`` object has at least three fields: ``train_set``, objects. Each ``task`` object has at least three fields: ``train_set``,
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment