"tests/git@developer.sourcefind.cn:renzhc/diffusers_dcu.git" did not exist on "aa0531fa8d360017a3433dc2aa4bd51d3b0aa389"
Unverified Commit 358db43a authored by Rhett Ying's avatar Rhett Ying Committed by GitHub
Browse files

[GraphBolt] add notebooks for OnDiskDataset (#6771)

parent 65d83ad7
...@@ -9,9 +9,8 @@ GraphBolt. ...@@ -9,9 +9,8 @@ GraphBolt.
.. toctree:: .. toctree::
:maxdepth: 1 :maxdepth: 1
:titlesonly:
neighbor_sampling_overview.nblink neighbor_sampling_overview.nblink
node_classification.nblink node_classification.nblink
link_prediction.nblink link_prediction.nblink
ondisk-dataset ondisk-dataset.rst
.. _stochastic_training-ondisk-dataset-specification:
Prepare dataset
===============
**GraphBolt** provides the ``OnDiskDataset`` class to help user organize plain
data of graph strucutre, feature data and tasks. ``OnDiskDataset`` is also
designed to efficiently handle large graphs and features that do not fit into
memory by storing them on disk.
To create an ``OnDiskDataset`` object, you need to organize all the data including
graph structure, feature data and tasks into a directory. The directory should
contain a ``metadata.yaml`` file that describes the metadata of the dataset.
Then just pass the directory path to the ``OnDiskDataset`` constructor to create
the dataset object.
.. code:: python
from dgl.graphbolt import OnDiskDataset
dataset = OnDiskDataset('/path/to/dataset')
The returned ``dataset`` object just loads the yaml file and does not load any
data. To load the graph structure, feature data and tasks, you need to call
the ``load`` method.
.. code:: python
dataset.load()
The reason why we separate the ``OnDiskDataset`` object creation and data loading
is that you may want to change some fields in the ``metadata.yaml`` file before
loading the data. For example, you may want to change the path of the feature
data files to point to a different directory. In this case, you can just
modify the path via ``dataset.yaml_data`` directly. Then call the ``load`` method
again to load the data.
After loading the data, you can access the graph structure, feature data and
tasks through the ``graph``, ``feature`` and ``tasks`` attributes respectively.
.. code:: python
graph = dataset.graph
feature = dataset.feature
tasks = dataset.tasks
The returned ``graph`` is a ``FusedCSCSamplingGraph`` object, which will be used
for sampling. The returned ``feature`` is a ``TorchBasedFeatureStore`` object,
which will be used for feature lookup. The returned ``tasks`` is a list of
``Task`` objects, which will be used for training and evaluation.
The following examples show data folder structure and ``metadata.yaml`` file for
homogeneous graphs and heterogeneous graphs respectively. If you want to know
the full YAML specification, please refer to the `Full YAML specification`_ section.
Homogeneous graph
-----------------
Data folder structure:
^^^^^^^^^^^^^^^^^^^^^
.. code::
data/
node_feat.npy
edge_feat.npy
edges/
edges.csv
set_nc/
train_seed_nodes.npy
train_labels.npy
val_seed_nodes.npy
val_labels.npy
test_seed_nodes.npy
test_labels.npy
set_lp/
train_node_pairs.npy
val_node_pairs.npy
val_negative_dsts.npy
test_node_pairs.npy
test_negative_dsts.npy
metadata.yaml
``metadata.yaml`` file:
^^^^^^^^^^^^^^^^^^^^^
.. code:: yaml
dataset_name: homogeneous_graph_nc_lp
graph:
nodes:
- num: 10
edges:
- format: csv
path: edges/edges.csv
feature_data:
- domain: node
name: feat
format: numpy
in_memory: true
path: data/node_feat.npy
- domain: edge
name: feat
format: numpy
in_memory: true
path: data/edge_feat.npy
tasks:
- name: node_classification
num_classes: 2
train_set:
- data:
- name: seed_nodes
format: numpy
in_memory: true
path: set_nc/train_seed_nodes.npy
- name: labels
format: numpy
in_memory: true
path: set_nc/train_labels.npy
validation_set:
- data:
- name: seed_nodes
format: numpy
in_memory: true
path: set_nc/val_seed_nodes.npy
- name: labels
format: numpy
in_memory: true
path: set_nc/val_labels.npy
test_set:
- data:
- name: seed_nodes
format: numpy
in_memory: true
path: set_nc/test_seed_nodes.npy
- name: labels
format: numpy
in_memory: true
path: set_nc/test_labels.npy
- name: link_prediction
num_classes: 2
train_set:
- data:
- name: node_pairs
format: numpy
in_memory: true
path: set_lp/train_node_pairs.npy
validation_set:
- data:
- name: node_pairs
format: numpy
in_memory: true
path: set_lp/val_node_pairs.npy
- name: negative_dsts
format: numpy
in_memory: true
path: set_lp/val_negative_dsts.npy
test_set:
- data:
- name: node_pairs
format: numpy
in_memory: true
path: set_lp/test_node_pairs.npy
- name: negative_dsts
format: numpy
in_memory: true
path: set_lp/test_negative_dsts.npy
For the graph structure, number of nodes is specified by the ``num`` field and
edges are stored in a csv file in format of ``<src, dst>`` like below.
.. code:: csv
edges.csv
0,1
1,2
2,3
3,4
4,5
5,6
6,7
7,8
8,9
For the feature data, we have feature data named as ``feat`` for nodes and
edges. The feature data are stored in numpy files in shape of ``[num_nodes, 10]``
and ``[num_edges, 10]`` respectively like below.
.. code:: python
node_feat.npy
array([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
[2., 2., 2., 2., 2., 2., 2., 2., 2., 2.],
[3., 3., 3., 3., 3., 3., 3., 3., 3., 3.],
[4., 4., 4., 4., 4., 4., 4., 4., 4., 4.],
[5., 5., 5., 5., 5., 5., 5., 5., 5., 5.],
[6., 6., 6., 6., 6., 6., 6., 6., 6., 6.],
[7., 7., 7., 7., 7., 7., 7., 7., 7., 7.],
[8., 8., 8., 8., 8., 8., 8., 8., 8., 8.],
[9., 9., 9., 9., 9., 9., 9., 9., 9., 9.]])
edge_feat.npy
array([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
[2., 2., 2., 2., 2., 2., 2., 2., 2., 2.],
[3., 3., 3., 3., 3., 3., 3., 3., 3., 3.],
[4., 4., 4., 4., 4., 4., 4., 4., 4., 4.],
[5., 5., 5., 5., 5., 5., 5., 5., 5., 5.],
[6., 6., 6., 6., 6., 6., 6., 6., 6., 6.],
[7., 7., 7., 7., 7., 7., 7., 7., 7., 7.],
[8., 8., 8., 8., 8., 8., 8., 8., 8., 8.],
[9., 9., 9., 9., 9., 9., 9., 9., 9., 9.]])
For the ``tasks`` field, we have two tasks: ``node_classification`` and
``link_prediction``. For each task, we have three sets: ``train_set``,
``validation_set`` and ``test_set``.
For ``node_classification`` task, we have two fields: ``seed_nodes`` and
``labels``. The ``seed_nodes`` field is used to specify the node IDs for
training and evaluation. The ``labels`` field is used to specify the
labels. Both of them are stored in numpy files with shape of ``[num_nodes]``
like below.
.. code:: python
train_seed_nodes.npy
array([0, 1, 2, 3, 4, 5])
train_labels.npy
array([0, 1, 0, 1, 0, 1])
val_seed_nodes.npy
array([6, 7])
val_labels.npy
array([0, 1])
test_seed_nodes.npy
array([8, 9])
test_labels.npy
array([0, 1])
For ``link_prediction`` task, we have two fields: ``node_pairs``,
``negative_dsts``. The ``node_pairs`` field is used to specify the node pairs.
The ``negative_dsts`` field is used to specify the negative destination nodes.
They are stored in numpy file with shape of ``[num_edges, 2]`` and
``[num_edges, num_neg_dsts]`` respectively like below.
.. code:: python
train_node_pairs.npy
array([[0, 1],
[1, 2],
[2, 3],
[3, 4],
[4, 5],
[5, 6]])
val_node_pairs.npy
array([[6, 7],
[7, 8]])
val_negative_dsts.npy
array([[8, 9],
[8, 9]])
test_node_pairs.npy
array([[8, 9],
[9, 0]])
test_negative_dsts.npy
array([[0, 1],
[0, 1]])
.. note::
The values of ``name`` fields in the ``task`` such as ``seed_nodes``,
``labels``, ``node_pairs`` and ``negative_dsts`` are mandatory. They are
used to specify the data fields of ``MiniBatch`` for sampling. The values
of ``name`` fields in the ``feature_data`` such as ``feat`` are user-defined.
Heterogeneous graph
-----------------
Data folder structure:
^^^^^^^^^^^^^^^^^^^^^
.. code::
data/
user_feat.npy
item_feat.npy
user_follow_user_feat.npy
user_click_item_feat.npy
edges/
user_follow_user.csv
user_click_item.csv
set_nc/
user_train_seed_nodes.npy
user_train_labels.npy
user_val_seed_nodes.npy
user_val_labels.npy
user_test_seed_nodes.npy
user_test_labels.npy
set_lp/
follow_train_node_pairs.npy
follow_val_node_pairs.npy
follow_val_negative_dsts.npy
follow_test_node_pairs.npy
follow_test_negative_dsts.npy
metadata.yaml
``metadata.yaml`` file:
^^^^^^^^^^^^^^^^^^^^^
.. code:: yaml
dataset_name: heterogeneous_graph_nc_lp
graph:
nodes:
- type: user
num: 10
- type: item
num: 10
edges:
- type: "user:follow:user"
format: csv
path: edges/user_follow_user.csv
- type: "user:click:item"
format: csv
path: edges/user_click_item.csv
feature_data:
- domain: node
type: user
name: feat
format: numpy
in_memory: true
path: data/user_feat.npy
- domain: node
type: item
name: feat
format: numpy
in_memory: true
path: data/item_feat.npy
- domain: edge
type: "user:follow:user"
name: feat
format: numpy
in_memory: true
path: data/user_follow_user_feat.npy
- domain: edge
type: "user:click:item"
name: feat
format: numpy
in_memory: true
path: data/user_click_item_feat.npy
tasks:
- name: node_classification
num_classes: 2
train_set:
- type: user
data:
- name: seed_nodes
format: numpy
in_memory: true
path: set/user_train_seed_nodes.npy
- name: labels
format: numpy
in_memory: true
path: set/user_train_labels.npy
validation_set:
- type: user
data:
- name: seed_nodes
format: numpy
in_memory: true
path: set/user_val_seed_nodes.npy
- name: labels
format: numpy
in_memory: true
path: set/user_val_labels.npy
test_set:
- type: user
data:
- name: seed_nodes
format: numpy
in_memory: true
path: set/user_test_seed_nodes.npy
- name: labels
format: numpy
in_memory: true
path: set/user_test_labels.npy
- name: link_prediction
num_classes: 2
train_set:
- type: "user:follow:user"
data:
- name: node_pairs
format: numpy
in_memory: true
path: set/follow_train_node_pairs.npy
validation_set:
- type: "user:follow:user"
data:
- name: node_pairs
format: numpy
in_memory: true
path: set/follow_val_node_pairs.npy
- name: negative_dsts
format: numpy
in_memory: true
path: set/follow_val_negative_dsts.npy
test_set:
- type: "user:follow:user"
data:
- name: node_pairs
format: numpy
in_memory: true
path: set/follow_test_node_pairs.npy
- name: negative_dsts
format: numpy
in_memory: true
path: set/follow_test_negative_dsts.npy
For the graph structure, we have two types of nodes: ``user`` and ``item``
in above example. Number of each node type is specified by the ``num`` field.
We have two types of edges: ``user:follow:user`` and ``user:click:item``.
The edges are stored in two columns of csv files like below.
.. code:: csv
user_follow_user.csv
0,1
1,2
2,3
3,4
4,5
5,6
6,7
7,8
8,9
user_click_item.csv
0,0
1,1
2,2
3,3
4,4
5,5
6,6
7,7
8,8
9,9
For the feature data, we have feature data named as ``feat`` for nodes and
edges. The feature data are stored in numpy files in shape of ``[num_nodes, 10]``
and ``[num_edges, 10]`` respectively like below.
.. code:: python
user_feat.npy
array([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
[2., 2., 2., 2., 2., 2., 2., 2., 2., 2.],
[3., 3., 3., 3., 3., 3., 3., 3., 3., 3.],
[4., 4., 4., 4., 4., 4., 4., 4., 4., 4.],
[5., 5., 5., 5., 5., 5., 5., 5., 5., 5.],
[6., 6., 6., 6., 6., 6., 6., 6., 6., 6.],
[7., 7., 7., 7., 7., 7., 7., 7., 7., 7.],
[8., 8., 8., 8., 8., 8., 8., 8., 8., 8.],
[9., 9., 9., 9., 9., 9., 9., 9., 9., 9.]])
item_feat.npy
array([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
[2., 2., 2., 2., 2., 2., 2., 2., 2., 2.],
[3., 3., 3., 3., 3., 3., 3., 3., 3., 3.],
[4., 4., 4., 4., 4., 4., 4., 4., 4., 4.],
[5., 5., 5., 5., 5., 5., 5., 5., 5., 5.],
[6., 6., 6., 6., 6., 6., 6., 6., 6., 6.],
[7., 7., 7., 7., 7., 7., 7., 7., 7., 7.],
[8., 8., 8., 8., 8., 8., 8., 8., 8., 8.],
[9., 9., 9., 9., 9., 9., 9., 9., 9., 9.]])
user_follow_user_feat.npy
array([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
[2., 2., 2., 2., 2., 2., 2., 2., 2., 2.],
[3., 3., 3., 3., 3., 3., 3., 3., 3., 3.],
[4., 4., 4., 4., 4., 4., 4., 4., 4., 4.],
[5., 5., 5., 5., 5., 5., 5., 5., 5., 5.],
[6., 6., 6., 6., 6., 6., 6., 6., 6., 6.],
[7., 7., 7., 7., 7., 7., 7., 7., 7., 7.],
[8., 8., 8., 8., 8., 8., 8., 8., 8., 8.],
[9., 9., 9., 9., 9., 9., 9., 9., 9., 9.]])
user_click_item_feat.npy
array([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
[2., 2., 2., 2., 2., 2., 2., 2., 2., 2.],
[3., 3., 3., 3., 3., 3., 3., 3., 3., 3.],
[4., 4., 4., 4., 4., 4., 4., 4., 4., 4.],
[5., 5., 5., 5., 5., 5., 5., 5., 5., 5.],
[6., 6., 6., 6., 6., 6., 6., 6., 6., 6.],
[7., 7., 7., 7., 7., 7., 7., 7., 7., 7.],
[8., 8., 8., 8., 8., 8., 8., 8., 8., 8.],
[9., 9., 9., 9., 9., 9., 9., 9., 9., 9.]])
For the ``tasks`` field, we have two tasks: ``node_classification`` and
``link_prediction``. For each task, we have three sets: ``train_set``,
``validation_set`` and ``test_set``.
For ``node_classification`` task, we have two fields: ``seed_nodes`` and
``labels``. The ``seed_nodes`` field is used to specify the node IDs for
training and evaluation. The ``labels`` field is used to specify the
labels. Both of them are stored in numpy files with shape of ``[num_nodes]``
like below.
.. code:: python
user_train_seed_nodes.npy
array([0, 1, 2, 3, 4, 5]])
user_train_labels.npy
array([0, 1, 0, 1, 0, 1])
user_val_seed_nodes.npy
array([6, 7])
user_val_labels.npy
array([0, 1])
user_test_seed_nodes.npy
array([8, 9])
user_test_labels.npy
array([0, 1])
For ``link_prediction`` task, we have two fields: ``node_pairs``,
``negative_dsts``. The ``node_pairs`` field is used to specify the node pairs.
The ``negative_dsts`` field is used to specify the negative destination nodes.
They are stored in numpy file with shape of ``[num_edges, 2]`` and
``[num_edges, num_neg_dsts]`` respectively like below.
.. code:: python
follow_train_node_pairs.npy
array([[0, 1]
[1, 2]
[2, 3]
[3, 4]
[4, 5]
[5, 6]]])
follow_val_node_pairs.npy
array([[6, 7]
[7, 8]])
follow_val_negative_dsts.npy
array([[8, 9]
[8, 9]]])
follow_test_node_pairs.npy
array([[8, 9]
[9, 0]]])
follow_test_negative_dsts.npy
array([[0, 1]
[0, 1]]])
Full YAML specification
-----------------------
The full YAML specification of ``metadata.yaml`` file is shown below.
.. code:: yaml
dataset_name: <string>
graph:
nodes:
- type: <string>
num: <int>
- type: <string>
num: <int>
edges:
- type: <string>
format: <string>
path: <string>
- type: <string>
format: <string>
path: <string>
feature_data:
- domain: node
type: <string>
name: <string>
format: <string>
in_memory: <bool>
path: <string>
- domain: node
type: <string>
name: <string>
format: <string>
in_memory: <bool>
path: <string>
- domain: edge
type: <string>
name: <string>
format: <string>
in_memory: <bool>
path: <string>
- domain: edge
type: <string>
name: <string>
format: <string>
in_memory: <bool>
path: <string>
tasks:
- name: <string>
num_classes: <int>
train_set:
- type: <string>
data:
- name: <string>
format: <string>
in_memory: <bool>
path: <string>
- name: <string>
format: <string>
in_memory: <bool>
path: <string>
validation_set:
- type: <string>
data:
- name: <string>
format: <string>
in_memory: <bool>
path: <string>
- name: <string>
format: <string>
in_memory: <bool>
path: <string>
test_set:
- type: <string>
data:
- name: <string>
format: <string>
in_memory: <bool>
path: <string>
- name: <string>
format: <string>
in_memory: <bool>
path: <string>
``dataset_name``
^^^^^^^^^^^^^^^
The ``dataset_name`` field is used to specify the name of the dataset. It is
user-defined.
``graph``
^^^^^^^^
The ``graph`` field is used to specify the graph structure. It has two fields:
``nodes`` and ``edges``.
- ``nodes``: ``list``
The ``nodes`` field is used to specify the number of nodes for each node type.
It is a list of ``node`` objects. Each ``node`` object has two fields: ``type``
and ``num``.
- ``type``: ``string``, optional
The ``type`` field is used to specify the node type. It is ``null`` for
homogeneous graphs. For heterogeneous graphs, it is the node type.
- ``num``: ``int``
The ``num`` field is used to specify the number of nodes for the node type.
It is mandatory for both homogeneous graphs and heterogeneous graphs.
- ``edges``: ``list``
The ``edges`` field is used to specify the edges. It is a list of ``edge``
objects. Each ``edge`` object has three fields: ``type``, ``format`` and
``path``.
- ``type``: ``string``, optional
The ``type`` field is used to specify the edge type. It is ``null`` for
homogeneous graphs. For heterogeneous graphs, it is the edge type.
- ``format``: ``string``
The ``format`` field is used to specify the format of the edge data. It can
only be ``csv`` for now.
- ``path``: ``string``
The ``path`` field is used to specify the path of the edge data. It is
relative to the directory of ``metadata.yaml`` file.
``feature_data``
^^^^^^^^^^^^^^^
The ``feature_data`` field is used to specify the feature data. It is a list of
``feature`` objects. Each ``feature`` object has five canonical fields: ``domain``,
``type``, ``name``, ``format`` and ``path``. Any other fields will be passed to
the ``Feature.metadata`` object.
- ``domain``: ``string``
The ``domain`` field is used to specify the domain of the feature data. It can
be either ``node`` or ``edge``.
- ``type``: ``string``, optional
The ``type`` field is used to specify the type of the feature data. It is
``null`` for homogeneous graphs. For heterogeneous graphs, it is the node or
edge type.
- ``name``: ``string``
The ``name`` field is used to specify the name of the feature data. It is
user-defined.
- ``format``: ``string``
The ``format`` field is used to specify the format of the feature data. It can
be either ``numpy`` or ``torch``.
- ``in_memory``: ``bool``, optional
The ``in_memory`` field is used to specify whether the feature data is loaded
into memory. It can be either ``true`` or ``false``. Default is ``true``.
- ``path``: ``string``
The ``path`` field is used to specify the path of the feature data. It is
relative to the directory of ``metadata.yaml`` file.
``tasks``
^^^^^^^^
The ``tasks`` field is used to specify the tasks. It is a list of ``task``
objects. Each ``task`` object has at least three fields: ``train_set``,
``validation_set``, ``test_set``. And you are free to add other fields
such as ``num_classes`` and all these fields will be passed to the
``Task.metadata`` object.
- ``name``: ``string``, optional
The ``name`` field is used to specify the name of the task. It is user-defined.
- ``num_classes``: ``int``, optional
The ``num_classes`` field is used to specify the number of classes of the task.
- ``train_set``: ``list``
The ``train_set`` field is used to specify the training set. It is a list of
``set`` objects. Each ``set`` object has two fields: ``type`` and ``data``.
- ``type``: ``string``, optional
The ``type`` field is used to specify the node/edge type of the set. It is
``null`` for homogeneous graphs. For heterogeneous graphs, it is the node
or edge type.
- ``data``: ``list``
The ``data`` field is used to load ``train_set``. It is a list of ``data``
objects. Each ``data`` object has four fields: ``name``, ``format``,
``in_memory`` and ``path``.
- ``name``: ``string``
The ``name`` field is used to specify the name of the data. It is mandatory
and used to specify the data fields of ``MiniBatch`` for sampling. It can
be either ``seed_nodes``, ``labels``, ``node_pairs``, ``negative_srcs`` or
``negative_dsts``. If any other name is used, it will be added into the
``MiniBatch`` data fields.
- ``format``: ``string``
The ``format`` field is used to specify the format of the data. It can be
either ``numpy`` or ``torch``.
- ``in_memory``: ``bool``, optional
The ``in_memory`` field is used to specify whether the data is loaded into
memory. It can be either ``true`` or ``false``. Default is ``true``.
- ``path``: ``string``
The ``path`` field is used to specify the path of the data. It is relative
to the directory of ``metadata.yaml`` file.
- ``validation_set``: ``list``
- ``test_set``: ``list``
The ``validation_set`` and ``test_set`` fields are used to specify the
validation set and test set respectively. They are similar to the
``train_set`` field.
.. _stochastic_training-ondisk-dataset: .. _stochastic_training-ondisk-dataset:
Prepare dataset Creating OnDiskDataset
=============== ======================
**GraphBolt** provides the ``OnDiskDataset`` class to help user organize plain This tutorial shows how to create an `OnDiskDataset` from raw data and use it
data of graph strucutre, feature data and tasks. ``OnDiskDataset`` is also for stochastic training.
designed to efficiently handle large graphs and features that do not fit into
memory by storing them on disk.
To create an ``OnDiskDataset`` object, you need to organize all the data including For more details about `OnDiskDataset`, please refer to the
graph structure, feature data and tasks into a directory. The directory should :class:`~dgl.graphbolt.OnDiskDataset` API documentation.
contain a ``metadata.yaml`` file that describes the metadata of the dataset.
Then just pass the directory path to the ``OnDiskDataset`` constructor to create .. toctree::
the dataset object. :maxdepth: 1
:glob:
.. code:: python
from dgl.graphbolt import OnDiskDataset
dataset = OnDiskDataset('/path/to/dataset')
The returned ``dataset`` object just loads the yaml file and does not load any
data. To load the graph structure, feature data and tasks, you need to call
the ``load`` method.
.. code:: python
dataset.load()
The reason why we separate the ``OnDiskDataset`` object creation and data loading
is that you may want to change some fields in the ``metadata.yaml`` file before
loading the data. For example, you may want to change the path of the feature
data files to point to a different directory. In this case, you can just
modify the path via ``dataset.yaml_data`` directly. Then call the ``load`` method
again to load the data.
After loading the data, you can access the graph structure, feature data and
tasks through the ``graph``, ``feature`` and ``tasks`` attributes respectively.
.. code:: python
graph = dataset.graph
feature = dataset.feature
tasks = dataset.tasks
The returned ``graph`` is a ``FusedCSCSamplingGraph`` object, which will be used
for sampling. The returned ``feature`` is a ``TorchBasedFeatureStore`` object,
which will be used for feature lookup. The returned ``tasks`` is a list of
``Task`` objects, which will be used for training and evaluation.
The following examples show data folder structure and ``metadata.yaml`` file for
homogeneous graphs and heterogeneous graphs respectively. If you want to know
the full YAML specification, please refer to the `Full YAML specification`_ section.
Homogeneous graph
-----------------
Data folder structure:
^^^^^^^^^^^^^^^^^^^^^
.. code::
data/
node_feat.npy
edge_feat.npy
edges/
edges.csv
set_nc/
train_seed_nodes.npy
train_labels.npy
val_seed_nodes.npy
val_labels.npy
test_seed_nodes.npy
test_labels.npy
set_lp/
train_node_pairs.npy
val_node_pairs.npy
val_negative_dsts.npy
test_node_pairs.npy
test_negative_dsts.npy
metadata.yaml
``metadata.yaml`` file:
^^^^^^^^^^^^^^^^^^^^^
.. code:: yaml
dataset_name: homogeneous_graph_nc_lp
graph:
nodes:
- num: 10
edges:
- format: csv
path: edges/edges.csv
feature_data:
- domain: node
name: feat
format: numpy
in_memory: true
path: data/node_feat.npy
- domain: edge
name: feat
format: numpy
in_memory: true
path: data/edge_feat.npy
tasks:
- name: node_classification
num_classes: 2
train_set:
- data:
- name: seed_nodes
format: numpy
in_memory: true
path: set_nc/train_seed_nodes.npy
- name: labels
format: numpy
in_memory: true
path: set_nc/train_labels.npy
validation_set:
- data:
- name: seed_nodes
format: numpy
in_memory: true
path: set_nc/val_seed_nodes.npy
- name: labels
format: numpy
in_memory: true
path: set_nc/val_labels.npy
test_set:
- data:
- name: seed_nodes
format: numpy
in_memory: true
path: set_nc/test_seed_nodes.npy
- name: labels
format: numpy
in_memory: true
path: set_nc/test_labels.npy
- name: link_prediction
num_classes: 2
train_set:
- data:
- name: node_pairs
format: numpy
in_memory: true
path: set_lp/train_node_pairs.npy
validation_set:
- data:
- name: node_pairs
format: numpy
in_memory: true
path: set_lp/val_node_pairs.npy
- name: negative_dsts
format: numpy
in_memory: true
path: set_lp/val_negative_dsts.npy
test_set:
- data:
- name: node_pairs
format: numpy
in_memory: true
path: set_lp/test_node_pairs.npy
- name: negative_dsts
format: numpy
in_memory: true
path: set_lp/test_negative_dsts.npy
For the graph structure, number of nodes is specified by the ``num`` field and
edges are stored in a csv file in format of ``<src, dst>`` like below.
.. code:: csv
edges.csv
0,1
1,2
2,3
3,4
4,5
5,6
6,7
7,8
8,9
For the feature data, we have feature data named as ``feat`` for nodes and
edges. The feature data are stored in numpy files in shape of ``[num_nodes, 10]``
and ``[num_edges, 10]`` respectively like below.
.. code:: python
node_feat.npy
array([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
[2., 2., 2., 2., 2., 2., 2., 2., 2., 2.],
[3., 3., 3., 3., 3., 3., 3., 3., 3., 3.],
[4., 4., 4., 4., 4., 4., 4., 4., 4., 4.],
[5., 5., 5., 5., 5., 5., 5., 5., 5., 5.],
[6., 6., 6., 6., 6., 6., 6., 6., 6., 6.],
[7., 7., 7., 7., 7., 7., 7., 7., 7., 7.],
[8., 8., 8., 8., 8., 8., 8., 8., 8., 8.],
[9., 9., 9., 9., 9., 9., 9., 9., 9., 9.]])
edge_feat.npy
array([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
[2., 2., 2., 2., 2., 2., 2., 2., 2., 2.],
[3., 3., 3., 3., 3., 3., 3., 3., 3., 3.],
[4., 4., 4., 4., 4., 4., 4., 4., 4., 4.],
[5., 5., 5., 5., 5., 5., 5., 5., 5., 5.],
[6., 6., 6., 6., 6., 6., 6., 6., 6., 6.],
[7., 7., 7., 7., 7., 7., 7., 7., 7., 7.],
[8., 8., 8., 8., 8., 8., 8., 8., 8., 8.],
[9., 9., 9., 9., 9., 9., 9., 9., 9., 9.]])
For the ``tasks`` field, we have two tasks: ``node_classification`` and
``link_prediction``. For each task, we have three sets: ``train_set``,
``validation_set`` and ``test_set``.
For ``node_classification`` task, we have two fields: ``seed_nodes`` and
``labels``. The ``seed_nodes`` field is used to specify the node IDs for
training and evaluation. The ``labels`` field is used to specify the
labels. Both of them are stored in numpy files with shape of ``[num_nodes]``
like below.
.. code:: python
train_seed_nodes.npy
array([0, 1, 2, 3, 4, 5])
train_labels.npy
array([0, 1, 0, 1, 0, 1])
val_seed_nodes.npy
array([6, 7])
val_labels.npy
array([0, 1])
test_seed_nodes.npy
array([8, 9])
test_labels.npy
array([0, 1])
For ``link_prediction`` task, we have two fields: ``node_pairs``,
``negative_dsts``. The ``node_pairs`` field is used to specify the node pairs.
The ``negative_dsts`` field is used to specify the negative destination nodes.
They are stored in numpy file with shape of ``[num_edges, 2]`` and
``[num_edges, num_neg_dsts]`` respectively like below.
.. code:: python
train_node_pairs.npy
array([[0, 1],
[1, 2],
[2, 3],
[3, 4],
[4, 5],
[5, 6]])
val_node_pairs.npy
array([[6, 7],
[7, 8]])
val_negative_dsts.npy
array([[8, 9],
[8, 9]])
test_node_pairs.npy
array([[8, 9],
[9, 0]])
test_negative_dsts.npy
array([[0, 1],
[0, 1]])
.. note::
The values of ``name`` fields in the ``task`` such as ``seed_nodes``,
``labels``, ``node_pairs`` and ``negative_dsts`` are mandatory. They are
used to specify the data fields of ``MiniBatch`` for sampling. The values
of ``name`` fields in the ``feature_data`` such as ``feat`` are user-defined.
Heterogeneous graph
-----------------
Data folder structure:
^^^^^^^^^^^^^^^^^^^^^
.. code::
data/
user_feat.npy
item_feat.npy
user_follow_user_feat.npy
user_click_item_feat.npy
edges/
user_follow_user.csv
user_click_item.csv
set_nc/
user_train_seed_nodes.npy
user_train_labels.npy
user_val_seed_nodes.npy
user_val_labels.npy
user_test_seed_nodes.npy
user_test_labels.npy
set_lp/
follow_train_node_pairs.npy
follow_val_node_pairs.npy
follow_val_negative_dsts.npy
follow_test_node_pairs.npy
follow_test_negative_dsts.npy
metadata.yaml
``metadata.yaml`` file:
^^^^^^^^^^^^^^^^^^^^^
.. code:: yaml
dataset_name: heterogeneous_graph_nc_lp
graph:
nodes:
- type: user
num: 10
- type: item
num: 10
edges:
- type: "user:follow:user"
format: csv
path: edges/user_follow_user.csv
- type: "user:click:item"
format: csv
path: edges/user_click_item.csv
feature_data:
- domain: node
type: user
name: feat
format: numpy
in_memory: true
path: data/user_feat.npy
- domain: node
type: item
name: feat
format: numpy
in_memory: true
path: data/item_feat.npy
- domain: edge
type: "user:follow:user"
name: feat
format: numpy
in_memory: true
path: data/user_follow_user_feat.npy
- domain: edge
type: "user:click:item"
name: feat
format: numpy
in_memory: true
path: data/user_click_item_feat.npy
tasks:
- name: node_classification
num_classes: 2
train_set:
- type: user
data:
- name: seed_nodes
format: numpy
in_memory: true
path: set/user_train_seed_nodes.npy
- name: labels
format: numpy
in_memory: true
path: set/user_train_labels.npy
validation_set:
- type: user
data:
- name: seed_nodes
format: numpy
in_memory: true
path: set/user_val_seed_nodes.npy
- name: labels
format: numpy
in_memory: true
path: set/user_val_labels.npy
test_set:
- type: user
data:
- name: seed_nodes
format: numpy
in_memory: true
path: set/user_test_seed_nodes.npy
- name: labels
format: numpy
in_memory: true
path: set/user_test_labels.npy
- name: link_prediction
num_classes: 2
train_set:
- type: "user:follow:user"
data:
- name: node_pairs
format: numpy
in_memory: true
path: set/follow_train_node_pairs.npy
validation_set:
- type: "user:follow:user"
data:
- name: node_pairs
format: numpy
in_memory: true
path: set/follow_val_node_pairs.npy
- name: negative_dsts
format: numpy
in_memory: true
path: set/follow_val_negative_dsts.npy
test_set:
- type: "user:follow:user"
data:
- name: node_pairs
format: numpy
in_memory: true
path: set/follow_test_node_pairs.npy
- name: negative_dsts
format: numpy
in_memory: true
path: set/follow_test_negative_dsts.npy
For the graph structure, we have two types of nodes: ``user`` and ``item``
in above example. Number of each node type is specified by the ``num`` field.
We have two types of edges: ``user:follow:user`` and ``user:click:item``.
The edges are stored in two columns of csv files like below.
.. code:: csv
user_follow_user.csv
0,1
1,2
2,3
3,4
4,5
5,6
6,7
7,8
8,9
user_click_item.csv
0,0
1,1
2,2
3,3
4,4
5,5
6,6
7,7
8,8
9,9
For the feature data, we have feature data named as ``feat`` for nodes and
edges. The feature data are stored in numpy files in shape of ``[num_nodes, 10]``
and ``[num_edges, 10]`` respectively like below.
.. code:: python
user_feat.npy
array([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
[2., 2., 2., 2., 2., 2., 2., 2., 2., 2.],
[3., 3., 3., 3., 3., 3., 3., 3., 3., 3.],
[4., 4., 4., 4., 4., 4., 4., 4., 4., 4.],
[5., 5., 5., 5., 5., 5., 5., 5., 5., 5.],
[6., 6., 6., 6., 6., 6., 6., 6., 6., 6.],
[7., 7., 7., 7., 7., 7., 7., 7., 7., 7.],
[8., 8., 8., 8., 8., 8., 8., 8., 8., 8.],
[9., 9., 9., 9., 9., 9., 9., 9., 9., 9.]])
item_feat.npy
array([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
[2., 2., 2., 2., 2., 2., 2., 2., 2., 2.],
[3., 3., 3., 3., 3., 3., 3., 3., 3., 3.],
[4., 4., 4., 4., 4., 4., 4., 4., 4., 4.],
[5., 5., 5., 5., 5., 5., 5., 5., 5., 5.],
[6., 6., 6., 6., 6., 6., 6., 6., 6., 6.],
[7., 7., 7., 7., 7., 7., 7., 7., 7., 7.],
[8., 8., 8., 8., 8., 8., 8., 8., 8., 8.],
[9., 9., 9., 9., 9., 9., 9., 9., 9., 9.]])
user_follow_user_feat.npy
array([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
[2., 2., 2., 2., 2., 2., 2., 2., 2., 2.],
[3., 3., 3., 3., 3., 3., 3., 3., 3., 3.],
[4., 4., 4., 4., 4., 4., 4., 4., 4., 4.],
[5., 5., 5., 5., 5., 5., 5., 5., 5., 5.],
[6., 6., 6., 6., 6., 6., 6., 6., 6., 6.],
[7., 7., 7., 7., 7., 7., 7., 7., 7., 7.],
[8., 8., 8., 8., 8., 8., 8., 8., 8., 8.],
[9., 9., 9., 9., 9., 9., 9., 9., 9., 9.]])
user_click_item_feat.npy
array([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
[2., 2., 2., 2., 2., 2., 2., 2., 2., 2.],
[3., 3., 3., 3., 3., 3., 3., 3., 3., 3.],
[4., 4., 4., 4., 4., 4., 4., 4., 4., 4.],
[5., 5., 5., 5., 5., 5., 5., 5., 5., 5.],
[6., 6., 6., 6., 6., 6., 6., 6., 6., 6.],
[7., 7., 7., 7., 7., 7., 7., 7., 7., 7.],
[8., 8., 8., 8., 8., 8., 8., 8., 8., 8.],
[9., 9., 9., 9., 9., 9., 9., 9., 9., 9.]])
For the ``tasks`` field, we have two tasks: ``node_classification`` and
``link_prediction``. For each task, we have three sets: ``train_set``,
``validation_set`` and ``test_set``.
For ``node_classification`` task, we have two fields: ``seed_nodes`` and
``labels``. The ``seed_nodes`` field is used to specify the node IDs for
training and evaluation. The ``labels`` field is used to specify the
labels. Both of them are stored in numpy files with shape of ``[num_nodes]``
like below.
.. code:: python
user_train_seed_nodes.npy
array([0, 1, 2, 3, 4, 5]])
user_train_labels.npy
array([0, 1, 0, 1, 0, 1])
user_val_seed_nodes.npy
array([6, 7])
user_val_labels.npy
array([0, 1])
user_test_seed_nodes.npy
array([8, 9])
user_test_labels.npy
array([0, 1])
For ``link_prediction`` task, we have two fields: ``node_pairs``,
``negative_dsts``. The ``node_pairs`` field is used to specify the node pairs.
The ``negative_dsts`` field is used to specify the negative destination nodes.
They are stored in numpy file with shape of ``[num_edges, 2]`` and
``[num_edges, num_neg_dsts]`` respectively like below.
.. code:: python
follow_train_node_pairs.npy
array([[0, 1]
[1, 2]
[2, 3]
[3, 4]
[4, 5]
[5, 6]]])
follow_val_node_pairs.npy
array([[6, 7]
[7, 8]])
follow_val_negative_dsts.npy
array([[8, 9]
[8, 9]]])
follow_test_node_pairs.npy
array([[8, 9]
[9, 0]]])
follow_test_negative_dsts.npy
array([[0, 1]
[0, 1]]])
Full YAML specification
-----------------------
The full YAML specification of ``metadata.yaml`` file is shown below.
.. code:: yaml
dataset_name: <string>
graph:
nodes:
- type: <string>
num: <int>
- type: <string>
num: <int>
edges:
- type: <string>
format: <string>
path: <string>
- type: <string>
format: <string>
path: <string>
feature_data:
- domain: node
type: <string>
name: <string>
format: <string>
in_memory: <bool>
path: <string>
- domain: node
type: <string>
name: <string>
format: <string>
in_memory: <bool>
path: <string>
- domain: edge
type: <string>
name: <string>
format: <string>
in_memory: <bool>
path: <string>
- domain: edge
type: <string>
name: <string>
format: <string>
in_memory: <bool>
path: <string>
tasks:
- name: <string>
num_classes: <int>
train_set:
- type: <string>
data:
- name: <string>
format: <string>
in_memory: <bool>
path: <string>
- name: <string>
format: <string>
in_memory: <bool>
path: <string>
validation_set:
- type: <string>
data:
- name: <string>
format: <string>
in_memory: <bool>
path: <string>
- name: <string>
format: <string>
in_memory: <bool>
path: <string>
test_set:
- type: <string>
data:
- name: <string>
format: <string>
in_memory: <bool>
path: <string>
- name: <string>
format: <string>
in_memory: <bool>
path: <string>
``dataset_name``
^^^^^^^^^^^^^^^
The ``dataset_name`` field is used to specify the name of the dataset. It is
user-defined.
``graph``
^^^^^^^^
The ``graph`` field is used to specify the graph structure. It has two fields:
``nodes`` and ``edges``.
- ``nodes``: ``list``
The ``nodes`` field is used to specify the number of nodes for each node type.
It is a list of ``node`` objects. Each ``node`` object has two fields: ``type``
and ``num``.
- ``type``: ``string``, optional
The ``type`` field is used to specify the node type. It is ``null`` for
homogeneous graphs. For heterogeneous graphs, it is the node type.
- ``num``: ``int``
The ``num`` field is used to specify the number of nodes for the node type.
It is mandatory for both homogeneous graphs and heterogeneous graphs.
- ``edges``: ``list``
The ``edges`` field is used to specify the edges. It is a list of ``edge``
objects. Each ``edge`` object has three fields: ``type``, ``format`` and
``path``.
- ``type``: ``string``, optional
The ``type`` field is used to specify the edge type. It is ``null`` for
homogeneous graphs. For heterogeneous graphs, it is the edge type.
- ``format``: ``string``
The ``format`` field is used to specify the format of the edge data. It can
only be ``csv`` for now.
- ``path``: ``string``
The ``path`` field is used to specify the path of the edge data. It is
relative to the directory of ``metadata.yaml`` file.
``feature_data``
^^^^^^^^^^^^^^^
The ``feature_data`` field is used to specify the feature data. It is a list of
``feature`` objects. Each ``feature`` object has five canonical fields: ``domain``,
``type``, ``name``, ``format`` and ``path``. Any other fields will be passed to
the ``Feature.metadata`` object.
- ``domain``: ``string``
The ``domain`` field is used to specify the domain of the feature data. It can
be either ``node`` or ``edge``.
- ``type``: ``string``, optional
The ``type`` field is used to specify the type of the feature data. It is
``null`` for homogeneous graphs. For heterogeneous graphs, it is the node or
edge type.
- ``name``: ``string``
The ``name`` field is used to specify the name of the feature data. It is
user-defined.
- ``format``: ``string``
The ``format`` field is used to specify the format of the feature data. It can
be either ``numpy`` or ``torch``.
- ``in_memory``: ``bool``, optional
The ``in_memory`` field is used to specify whether the feature data is loaded
into memory. It can be either ``true`` or ``false``. Default is ``true``.
- ``path``: ``string``
The ``path`` field is used to specify the path of the feature data. It is
relative to the directory of ``metadata.yaml`` file.
``tasks``
^^^^^^^^
The ``tasks`` field is used to specify the tasks. It is a list of ``task``
objects. Each ``task`` object has at least three fields: ``train_set``,
``validation_set``, ``test_set``. And you are free to add other fields
such as ``num_classes`` and all these fields will be passed to the
``Task.metadata`` object.
- ``name``: ``string``, optional
The ``name`` field is used to specify the name of the task. It is user-defined.
- ``num_classes``: ``int``, optional
The ``num_classes`` field is used to specify the number of classes of the task.
- ``train_set``: ``list``
The ``train_set`` field is used to specify the training set. It is a list of
``set`` objects. Each ``set`` object has two fields: ``type`` and ``data``.
- ``type``: ``string``, optional
The ``type`` field is used to specify the node/edge type of the set. It is
``null`` for homogeneous graphs. For heterogeneous graphs, it is the node
or edge type.
- ``data``: ``list``
The ``data`` field is used to load ``train_set``. It is a list of ``data``
objects. Each ``data`` object has four fields: ``name``, ``format``,
``in_memory`` and ``path``.
- ``name``: ``string``
The ``name`` field is used to specify the name of the data. It is mandatory
and used to specify the data fields of ``MiniBatch`` for sampling. It can
be either ``seed_nodes``, ``labels``, ``node_pairs``, ``negative_srcs`` or
``negative_dsts``. If any other name is used, it will be added into the
``MiniBatch`` data fields.
- ``format``: ``string``
The ``format`` field is used to specify the format of the data. It can be
either ``numpy`` or ``torch``.
- ``in_memory``: ``bool``, optional
The ``in_memory`` field is used to specify whether the data is loaded into
memory. It can be either ``true`` or ``false``. Default is ``true``.
- ``path``: ``string``
The ``path`` field is used to specify the path of the data. It is relative
to the directory of ``metadata.yaml`` file.
- ``validation_set``: ``list``
- ``test_set``: ``list``
The ``validation_set`` and ``test_set`` fields are used to specify the
validation set and test set respectively. They are similar to the
``train_set`` field.
ondisk_dataset_homograph.nblink
ondisk_dataset_heterograph.nblink
ondisk-dataset-specification.rst
{
"path": "../../../notebooks/stochastic_training/ondisk_dataset_heterograph.ipynb"
}
{
"path": "../../../notebooks/stochastic_training/ondisk_dataset_homograph.ipynb"
}
{
"nbformat": 4,
"nbformat_minor": 0,
"metadata": {
"colab": {
"private_outputs": true,
"provenance": [],
"authorship_tag": "ABX9TyM1zJGR6lVdC9JfDbddFLpa"
},
"kernelspec": {
"name": "python3",
"display_name": "Python 3"
},
"language_info": {
"name": "python"
}
},
"cells": [
{
"cell_type": "markdown",
"source": [
"# OnDiskDataset for Heterogeneous Graph\n",
"\n",
"[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/dmlc/dgl/blob/master/notebooks/stochastic_training/ondisk_dataset_heterograph.ipynb) [![GitHub](https://img.shields.io/badge/-View%20on%20GitHub-181717?logo=github&logoColor=ffffff)](https://github.com/dmlc/dgl/blob/master/notebooks/stochastic_training/ondisk_dataset_heterograph.ipynb)\n",
"\n",
"This tutorial shows how to create `OnDiskDataset` for heterogeneous graph that could be used in **GraphBolt** framework.\n",
"\n",
"By the end of this tutorial, you will be able to\n",
"- organize graph structure data.\n",
"- organize feature data.\n",
"- organize training/validation/test set for specific tasks."
],
"metadata": {
"id": "FnFhPMaAfLtJ"
}
},
{
"cell_type": "markdown",
"source": [
"## Install DGL package"
],
"metadata": {
"id": "Wlb19DtWgtzq"
}
},
{
"cell_type": "code",
"source": [
"# Install required packages.\n",
"import os\n",
"import torch\n",
"import numpy as np\n",
"os.environ['TORCH'] = torch.__version__\n",
"os.environ['DGLBACKEND'] = \"pytorch\"\n",
"\n",
"# Install the CPU version.\n",
"device = torch.device(\"cpu\")\n",
"!pip install --pre dgl -f https://data.dgl.ai/wheels-test/repo.html\n",
"\n",
"try:\n",
" import dgl\n",
" import dgl.graphbolt as gb\n",
" installed = True\n",
"except ImportError as error:\n",
" installed = False\n",
" print(error)\n",
"print(\"DGL installed!\" if installed else \"DGL not found!\")"
],
"metadata": {
"id": "UojlT9ZGgyr9"
},
"execution_count": null,
"outputs": []
}
]
}
{
"nbformat": 4,
"nbformat_minor": 0,
"metadata": {
"colab": {
"private_outputs": true,
"provenance": [],
"authorship_tag": "ABX9TyMnOgpk68ZvpOQVFBgDxDof"
},
"kernelspec": {
"name": "python3",
"display_name": "Python 3"
},
"language_info": {
"name": "python"
}
},
"cells": [
{
"cell_type": "markdown",
"source": [
"# OnDiskDataset for Homogeneous Graph\n",
"\n",
"[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/dmlc/dgl/blob/master/notebooks/stochastic_training/ondisk_dataset_homograph.ipynb) [![GitHub](https://img.shields.io/badge/-View%20on%20GitHub-181717?logo=github&logoColor=ffffff)](https://github.com/dmlc/dgl/blob/master/notebooks/stochastic_training/ondisk_dataset_homograph.ipynb)\n",
"\n",
"This tutorial shows how to create `OnDiskDataset` for homogeneous graph that could be used in **GraphBolt** framework.\n",
"\n",
"By the end of this tutorial, you will be able to\n",
"- organize graph structure data.\n",
"- organize feature data.\n",
"- organize training/validation/test set for specific tasks."
],
"metadata": {
"id": "FnFhPMaAfLtJ"
}
},
{
"cell_type": "markdown",
"source": [
"## Install DGL package"
],
"metadata": {
"id": "Wlb19DtWgtzq"
}
},
{
"cell_type": "code",
"source": [
"# Install required packages.\n",
"import os\n",
"import torch\n",
"import numpy as np\n",
"os.environ['TORCH'] = torch.__version__\n",
"os.environ['DGLBACKEND'] = \"pytorch\"\n",
"\n",
"# Install the CPU version.\n",
"device = torch.device(\"cpu\")\n",
"!pip install --pre dgl -f https://data.dgl.ai/wheels-test/repo.html\n",
"\n",
"try:\n",
" import dgl\n",
" import dgl.graphbolt as gb\n",
" installed = True\n",
"except ImportError as error:\n",
" installed = False\n",
" print(error)\n",
"print(\"DGL installed!\" if installed else \"DGL not found!\")"
],
"metadata": {
"id": "UojlT9ZGgyr9"
},
"execution_count": null,
"outputs": []
}
]
}
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment