"git@developer.sourcefind.cn:change/sglang.git" did not exist on "3f41b48c4079d0adaf17ce7adf308c6a0d947ad0"
Unverified Commit 7094ff4f authored by Rhett Ying's avatar Rhett Ying Committed by GitHub
Browse files

[doc] add spec for OnDiskDataset (#6810)

parent ca675ed4
.. _stochastic_training-ondisk-dataset-specification:
Prepare dataset
===============
YAML specification
==================
**GraphBolt** provides the ``OnDiskDataset`` class to help user organize plain
data of graph strucutre, feature data and tasks. ``OnDiskDataset`` is also
designed to efficiently handle large graphs and features that do not fit into
memory by storing them on disk.
To create an ``OnDiskDataset`` object, you need to organize all the data including
graph structure, feature data and tasks into a directory. The directory should
contain a ``metadata.yaml`` file that describes the metadata of the dataset.
Then just pass the directory path to the ``OnDiskDataset`` constructor to create
the dataset object.
.. code:: python
from dgl.graphbolt import OnDiskDataset
dataset = OnDiskDataset('/path/to/dataset')
The returned ``dataset`` object just loads the yaml file and does not load any
data. To load the graph structure, feature data and tasks, you need to call
the ``load`` method.
.. code:: python
dataset.load()
The reason why we separate the ``OnDiskDataset`` object creation and data loading
is that you may want to change some fields in the ``metadata.yaml`` file before
loading the data. For example, you may want to change the path of the feature
data files to point to a different directory. In this case, you can just
modify the path via ``dataset.yaml_data`` directly. Then call the ``load`` method
again to load the data.
After loading the data, you can access the graph structure, feature data and
tasks through the ``graph``, ``feature`` and ``tasks`` attributes respectively.
.. code:: python
graph = dataset.graph
feature = dataset.feature
tasks = dataset.tasks
The returned ``graph`` is a ``FusedCSCSamplingGraph`` object, which will be used
for sampling. The returned ``feature`` is a ``TorchBasedFeatureStore`` object,
which will be used for feature lookup. The returned ``tasks`` is a list of
``Task`` objects, which will be used for training and evaluation.
The following examples show data folder structure and ``metadata.yaml`` file for
homogeneous graphs and heterogeneous graphs respectively. If you want to know
the full YAML specification, please refer to the `Full YAML specification`_ section.
Homogeneous graph
-----------------
Data folder structure:
^^^^^^^^^^^^^^^^^^^^^
.. code::
data/
node_feat.npy
edge_feat.npy
edges/
edges.csv
set_nc/
train_seed_nodes.npy
train_labels.npy
val_seed_nodes.npy
val_labels.npy
test_seed_nodes.npy
test_labels.npy
set_lp/
train_node_pairs.npy
val_node_pairs.npy
val_negative_dsts.npy
test_node_pairs.npy
test_negative_dsts.npy
metadata.yaml
``metadata.yaml`` file:
^^^^^^^^^^^^^^^^^^^^^
.. code:: yaml
dataset_name: homogeneous_graph_nc_lp
graph:
nodes:
- num: 10
edges:
- format: csv
path: edges/edges.csv
feature_data:
- domain: node
name: feat
format: numpy
in_memory: true
path: data/node_feat.npy
- domain: edge
name: feat
format: numpy
in_memory: true
path: data/edge_feat.npy
tasks:
- name: node_classification
num_classes: 2
train_set:
- data:
- name: seed_nodes
format: numpy
in_memory: true
path: set_nc/train_seed_nodes.npy
- name: labels
format: numpy
in_memory: true
path: set_nc/train_labels.npy
validation_set:
- data:
- name: seed_nodes
format: numpy
in_memory: true
path: set_nc/val_seed_nodes.npy
- name: labels
format: numpy
in_memory: true
path: set_nc/val_labels.npy
test_set:
- data:
- name: seed_nodes
format: numpy
in_memory: true
path: set_nc/test_seed_nodes.npy
- name: labels
format: numpy
in_memory: true
path: set_nc/test_labels.npy
- name: link_prediction
num_classes: 2
train_set:
- data:
- name: node_pairs
format: numpy
in_memory: true
path: set_lp/train_node_pairs.npy
validation_set:
- data:
- name: node_pairs
format: numpy
in_memory: true
path: set_lp/val_node_pairs.npy
- name: negative_dsts
format: numpy
in_memory: true
path: set_lp/val_negative_dsts.npy
test_set:
- data:
- name: node_pairs
format: numpy
in_memory: true
path: set_lp/test_node_pairs.npy
- name: negative_dsts
format: numpy
in_memory: true
path: set_lp/test_negative_dsts.npy
For the graph structure, number of nodes is specified by the ``num`` field and
edges are stored in a csv file in format of ``<src, dst>`` like below.
.. code:: csv
edges.csv
0,1
1,2
2,3
3,4
4,5
5,6
6,7
7,8
8,9
For the feature data, we have feature data named as ``feat`` for nodes and
edges. The feature data are stored in numpy files in shape of ``[num_nodes, 10]``
and ``[num_edges, 10]`` respectively like below.
.. code:: python
node_feat.npy
array([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
[2., 2., 2., 2., 2., 2., 2., 2., 2., 2.],
[3., 3., 3., 3., 3., 3., 3., 3., 3., 3.],
[4., 4., 4., 4., 4., 4., 4., 4., 4., 4.],
[5., 5., 5., 5., 5., 5., 5., 5., 5., 5.],
[6., 6., 6., 6., 6., 6., 6., 6., 6., 6.],
[7., 7., 7., 7., 7., 7., 7., 7., 7., 7.],
[8., 8., 8., 8., 8., 8., 8., 8., 8., 8.],
[9., 9., 9., 9., 9., 9., 9., 9., 9., 9.]])
edge_feat.npy
array([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
[2., 2., 2., 2., 2., 2., 2., 2., 2., 2.],
[3., 3., 3., 3., 3., 3., 3., 3., 3., 3.],
[4., 4., 4., 4., 4., 4., 4., 4., 4., 4.],
[5., 5., 5., 5., 5., 5., 5., 5., 5., 5.],
[6., 6., 6., 6., 6., 6., 6., 6., 6., 6.],
[7., 7., 7., 7., 7., 7., 7., 7., 7., 7.],
[8., 8., 8., 8., 8., 8., 8., 8., 8., 8.],
[9., 9., 9., 9., 9., 9., 9., 9., 9., 9.]])
For the ``tasks`` field, we have two tasks: ``node_classification`` and
``link_prediction``. For each task, we have three sets: ``train_set``,
``validation_set`` and ``test_set``.
For ``node_classification`` task, we have two fields: ``seed_nodes`` and
``labels``. The ``seed_nodes`` field is used to specify the node IDs for
training and evaluation. The ``labels`` field is used to specify the
labels. Both of them are stored in numpy files with shape of ``[num_nodes]``
like below.
.. code:: python
train_seed_nodes.npy
array([0, 1, 2, 3, 4, 5])
train_labels.npy
array([0, 1, 0, 1, 0, 1])
val_seed_nodes.npy
array([6, 7])
val_labels.npy
array([0, 1])
test_seed_nodes.npy
array([8, 9])
test_labels.npy
array([0, 1])
For ``link_prediction`` task, we have two fields: ``node_pairs``,
``negative_dsts``. The ``node_pairs`` field is used to specify the node pairs.
The ``negative_dsts`` field is used to specify the negative destination nodes.
They are stored in numpy file with shape of ``[num_edges, 2]`` and
``[num_edges, num_neg_dsts]`` respectively like below.
.. code:: python
train_node_pairs.npy
array([[0, 1],
[1, 2],
[2, 3],
[3, 4],
[4, 5],
[5, 6]])
val_node_pairs.npy
array([[6, 7],
[7, 8]])
val_negative_dsts.npy
array([[8, 9],
[8, 9]])
test_node_pairs.npy
array([[8, 9],
[9, 0]])
test_negative_dsts.npy
array([[0, 1],
[0, 1]])
.. note::
The values of ``name`` fields in the ``task`` such as ``seed_nodes``,
``labels``, ``node_pairs`` and ``negative_dsts`` are mandatory. They are
used to specify the data fields of ``MiniBatch`` for sampling. The values
of ``name`` fields in the ``feature_data`` such as ``feat`` are user-defined.
Heterogeneous graph
-----------------
Data folder structure:
^^^^^^^^^^^^^^^^^^^^^
.. code::
data/
user_feat.npy
item_feat.npy
user_follow_user_feat.npy
user_click_item_feat.npy
edges/
user_follow_user.csv
user_click_item.csv
set_nc/
user_train_seed_nodes.npy
user_train_labels.npy
user_val_seed_nodes.npy
user_val_labels.npy
user_test_seed_nodes.npy
user_test_labels.npy
set_lp/
follow_train_node_pairs.npy
follow_val_node_pairs.npy
follow_val_negative_dsts.npy
follow_test_node_pairs.npy
follow_test_negative_dsts.npy
metadata.yaml
``metadata.yaml`` file:
^^^^^^^^^^^^^^^^^^^^^
.. code:: yaml
dataset_name: heterogeneous_graph_nc_lp
graph:
nodes:
- type: user
num: 10
- type: item
num: 10
edges:
- type: "user:follow:user"
format: csv
path: edges/user_follow_user.csv
- type: "user:click:item"
format: csv
path: edges/user_click_item.csv
feature_data:
- domain: node
type: user
name: feat
format: numpy
in_memory: true
path: data/user_feat.npy
- domain: node
type: item
name: feat
format: numpy
in_memory: true
path: data/item_feat.npy
- domain: edge
type: "user:follow:user"
name: feat
format: numpy
in_memory: true
path: data/user_follow_user_feat.npy
- domain: edge
type: "user:click:item"
name: feat
format: numpy
in_memory: true
path: data/user_click_item_feat.npy
tasks:
- name: node_classification
num_classes: 2
train_set:
- type: user
data:
- name: seed_nodes
format: numpy
in_memory: true
path: set/user_train_seed_nodes.npy
- name: labels
format: numpy
in_memory: true
path: set/user_train_labels.npy
validation_set:
- type: user
data:
- name: seed_nodes
format: numpy
in_memory: true
path: set/user_val_seed_nodes.npy
- name: labels
format: numpy
in_memory: true
path: set/user_val_labels.npy
test_set:
- type: user
data:
- name: seed_nodes
format: numpy
in_memory: true
path: set/user_test_seed_nodes.npy
- name: labels
format: numpy
in_memory: true
path: set/user_test_labels.npy
- name: link_prediction
num_classes: 2
train_set:
- type: "user:follow:user"
data:
- name: node_pairs
format: numpy
in_memory: true
path: set/follow_train_node_pairs.npy
validation_set:
- type: "user:follow:user"
data:
- name: node_pairs
format: numpy
in_memory: true
path: set/follow_val_node_pairs.npy
- name: negative_dsts
format: numpy
in_memory: true
path: set/follow_val_negative_dsts.npy
test_set:
- type: "user:follow:user"
data:
- name: node_pairs
format: numpy
in_memory: true
path: set/follow_test_node_pairs.npy
- name: negative_dsts
format: numpy
in_memory: true
path: set/follow_test_negative_dsts.npy
For the graph structure, we have two types of nodes: ``user`` and ``item``
in above example. Number of each node type is specified by the ``num`` field.
We have two types of edges: ``user:follow:user`` and ``user:click:item``.
The edges are stored in two columns of csv files like below.
.. code:: csv
user_follow_user.csv
0,1
1,2
2,3
3,4
4,5
5,6
6,7
7,8
8,9
user_click_item.csv
0,0
1,1
2,2
3,3
4,4
5,5
6,6
7,7
8,8
9,9
For the feature data, we have feature data named as ``feat`` for nodes and
edges. The feature data are stored in numpy files in shape of ``[num_nodes, 10]``
and ``[num_edges, 10]`` respectively like below.
.. code:: python
user_feat.npy
array([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
[2., 2., 2., 2., 2., 2., 2., 2., 2., 2.],
[3., 3., 3., 3., 3., 3., 3., 3., 3., 3.],
[4., 4., 4., 4., 4., 4., 4., 4., 4., 4.],
[5., 5., 5., 5., 5., 5., 5., 5., 5., 5.],
[6., 6., 6., 6., 6., 6., 6., 6., 6., 6.],
[7., 7., 7., 7., 7., 7., 7., 7., 7., 7.],
[8., 8., 8., 8., 8., 8., 8., 8., 8., 8.],
[9., 9., 9., 9., 9., 9., 9., 9., 9., 9.]])
item_feat.npy
array([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
[2., 2., 2., 2., 2., 2., 2., 2., 2., 2.],
[3., 3., 3., 3., 3., 3., 3., 3., 3., 3.],
[4., 4., 4., 4., 4., 4., 4., 4., 4., 4.],
[5., 5., 5., 5., 5., 5., 5., 5., 5., 5.],
[6., 6., 6., 6., 6., 6., 6., 6., 6., 6.],
[7., 7., 7., 7., 7., 7., 7., 7., 7., 7.],
[8., 8., 8., 8., 8., 8., 8., 8., 8., 8.],
[9., 9., 9., 9., 9., 9., 9., 9., 9., 9.]])
user_follow_user_feat.npy
array([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
[2., 2., 2., 2., 2., 2., 2., 2., 2., 2.],
[3., 3., 3., 3., 3., 3., 3., 3., 3., 3.],
[4., 4., 4., 4., 4., 4., 4., 4., 4., 4.],
[5., 5., 5., 5., 5., 5., 5., 5., 5., 5.],
[6., 6., 6., 6., 6., 6., 6., 6., 6., 6.],
[7., 7., 7., 7., 7., 7., 7., 7., 7., 7.],
[8., 8., 8., 8., 8., 8., 8., 8., 8., 8.],
[9., 9., 9., 9., 9., 9., 9., 9., 9., 9.]])
user_click_item_feat.npy
array([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
[2., 2., 2., 2., 2., 2., 2., 2., 2., 2.],
[3., 3., 3., 3., 3., 3., 3., 3., 3., 3.],
[4., 4., 4., 4., 4., 4., 4., 4., 4., 4.],
[5., 5., 5., 5., 5., 5., 5., 5., 5., 5.],
[6., 6., 6., 6., 6., 6., 6., 6., 6., 6.],
[7., 7., 7., 7., 7., 7., 7., 7., 7., 7.],
[8., 8., 8., 8., 8., 8., 8., 8., 8., 8.],
[9., 9., 9., 9., 9., 9., 9., 9., 9., 9.]])
For the ``tasks`` field, we have two tasks: ``node_classification`` and
``link_prediction``. For each task, we have three sets: ``train_set``,
``validation_set`` and ``test_set``.
For ``node_classification`` task, we have two fields: ``seed_nodes`` and
``labels``. The ``seed_nodes`` field is used to specify the node IDs for
training and evaluation. The ``labels`` field is used to specify the
labels. Both of them are stored in numpy files with shape of ``[num_nodes]``
like below.
.. code:: python
user_train_seed_nodes.npy
array([0, 1, 2, 3, 4, 5]])
user_train_labels.npy
array([0, 1, 0, 1, 0, 1])
user_val_seed_nodes.npy
array([6, 7])
user_val_labels.npy
array([0, 1])
user_test_seed_nodes.npy
array([8, 9])
user_test_labels.npy
array([0, 1])
For ``link_prediction`` task, we have two fields: ``node_pairs``,
``negative_dsts``. The ``node_pairs`` field is used to specify the node pairs.
The ``negative_dsts`` field is used to specify the negative destination nodes.
They are stored in numpy file with shape of ``[num_edges, 2]`` and
``[num_edges, num_neg_dsts]`` respectively like below.
.. code:: python
follow_train_node_pairs.npy
array([[0, 1]
[1, 2]
[2, 3]
[3, 4]
[4, 5]
[5, 6]]])
follow_val_node_pairs.npy
array([[6, 7]
[7, 8]])
follow_val_negative_dsts.npy
array([[8, 9]
[8, 9]]])
follow_test_node_pairs.npy
array([[8, 9]
[9, 0]]])
follow_test_negative_dsts.npy
array([[0, 1]
[0, 1]]])
Full YAML specification
-----------------------
The full YAML specification of ``metadata.yaml`` file is shown below.
This document describes the YAML specification of ``metadata.yaml`` file for
``OnDiskDataset``. ``metadata.yaml`` file is used to specify the dataset
information, including the graph structure, feature data and tasks.
.. code:: yaml
......@@ -697,13 +86,13 @@ The full YAML specification of ``metadata.yaml`` file is shown below.
path: <string>
``dataset_name``
^^^^^^^^^^^^^^^
---------------
The ``dataset_name`` field is used to specify the name of the dataset. It is
user-defined.
``graph``
^^^^^^^^
---------
The ``graph`` field is used to specify the graph structure. It has two fields:
``nodes`` and ``edges``.
......@@ -742,7 +131,7 @@ The ``graph`` field is used to specify the graph structure. It has two fields:
``feature_data``
^^^^^^^^^^^^^^^
----------------
The ``feature_data`` field is used to specify the feature data. It is a list of
``feature`` objects. Each ``feature`` object has five canonical fields: ``domain``,
......@@ -777,7 +166,7 @@ the ``Feature.metadata`` object.
``tasks``
^^^^^^^^
---------
The ``tasks`` field is used to specify the tasks. It is a list of ``task``
objects. Each ``task`` object has at least three fields: ``train_set``,
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment