Unverified Commit 3e59b1d4 authored by Rhett Ying's avatar Rhett Ying Committed by GitHub
Browse files

[doc] update OnDiskDataset (#6811)

parent 7094ff4f
.. _stochastic_training-ondisk-dataset: .. _stochastic_training-ondisk-dataset:
Creating OnDiskDataset Composing OnDiskDataset from raw data
====================== =====================================
This tutorial shows how to create an `OnDiskDataset` from raw data and use it This tutorial shows how to compose :class:`~dgl.graphbolt.OnDiskDataset` from
for stochastic training. raw data. A full specification of ``metadata.yaml`` is also provided.
**GraphBolt** provides the ``OnDiskDataset`` class to help user organize plain **GraphBolt** provides the ``OnDiskDataset`` class to help user organize plain
data of graph strucutre, feature data and tasks. ``OnDiskDataset`` is also data of graph strucutre, feature data and tasks. ``OnDiskDataset`` is also
designed to efficiently handle large graphs and features that do not fit into designed to efficiently handle large graphs and features that do not fit into
memory by storing them on disk. memory by storing them on disk.
For more details about `OnDiskDataset`, please refer to the
:class:`~dgl.graphbolt.OnDiskDataset` API documentation.
.. toctree:: .. toctree::
:maxdepth: 1 :maxdepth: 1
:glob: :glob:
......
...@@ -25,6 +25,7 @@ ...@@ -25,6 +25,7 @@
"This tutorial shows how to create `OnDiskDataset` for heterogeneous graph that could be used in **GraphBolt** framework. The major difference from creating dataset for homogeneous graph is that we need to specify node/edge types for edges, feature data, training/validation/test sets.\n", "This tutorial shows how to create `OnDiskDataset` for heterogeneous graph that could be used in **GraphBolt** framework. The major difference from creating dataset for homogeneous graph is that we need to specify node/edge types for edges, feature data, training/validation/test sets.\n",
"\n", "\n",
"By the end of this tutorial, you will be able to\n", "By the end of this tutorial, you will be able to\n",
"\n",
"- organize graph structure data.\n", "- organize graph structure data.\n",
"- organize feature data.\n", "- organize feature data.\n",
"- organize training/validation/test set for specific tasks.\n", "- organize training/validation/test set for specific tasks.\n",
...@@ -104,7 +105,7 @@ ...@@ -104,7 +105,7 @@
"### Generate graph structure data\n", "### Generate graph structure data\n",
"For heterogeneous graph, we need to save different edge edges(namely node pairs) into separate **CSV** files.\n", "For heterogeneous graph, we need to save different edge edges(namely node pairs) into separate **CSV** files.\n",
"\n", "\n",
"Note:\n", "**Note**:\n",
"when saving to file, do not save index and header.\n" "when saving to file, do not save index and header.\n"
], ],
"metadata": { "metadata": {
......
...@@ -25,6 +25,7 @@ ...@@ -25,6 +25,7 @@
"This tutorial shows how to create `OnDiskDataset` for homogeneous graph that could be used in **GraphBolt** framework.\n", "This tutorial shows how to create `OnDiskDataset` for homogeneous graph that could be used in **GraphBolt** framework.\n",
"\n", "\n",
"By the end of this tutorial, you will be able to\n", "By the end of this tutorial, you will be able to\n",
"\n",
"- organize graph structure data.\n", "- organize graph structure data.\n",
"- organize feature data.\n", "- organize feature data.\n",
"- organize training/validation/test set for specific tasks.\n", "- organize training/validation/test set for specific tasks.\n",
......
...@@ -281,22 +281,33 @@ class OnDiskDataset(Dataset): ...@@ -281,22 +281,33 @@ class OnDiskDataset(Dataset):
Due to limited resources, the data which are too large to fit into RAM will Due to limited resources, the data which are too large to fit into RAM will
remain on disk while others reside in RAM once ``OnDiskDataset`` is remain on disk while others reside in RAM once ``OnDiskDataset`` is
initialized. This behavior could be controled by user via ``in_memory`` initialized. This behavior could be controled by user via ``in_memory``
field in YAML file. field in YAML file. All paths in YAML file are relative paths to the
dataset directory.
A full example of YAML file is as follows: A full example of YAML file is as follows:
.. code-block:: yaml .. code-block:: yaml
dataset_name: graphbolt_test dataset_name: graphbolt_test
graph_topology: graph:
type: FusedCSCSamplingGraph nodes:
path: graph_topology/fused_csc_sampling_graph.tar - type: paper # could be omitted for homogeneous graph.
num: 1000
- type: author
num: 1000
edges:
- type: author:writes:paper # could be omitted for homogeneous graph.
format: csv # Can be csv only.
path: edge_data/author-writes-paper.csv
- type: paper:cites:paper
format: csv
path: edge_data/paper-cites-paper.csv
feature_data: feature_data:
- domain: node - domain: node
type: paper type: paper # could be omitted for homogeneous graph.
name: feat name: feat
format: numpy format: numpy
in_memory: false in_memory: false # If not specified, default to true.
path: node_data/paper-feat.npy path: node_data/paper-feat.npy
- domain: edge - domain: edge
type: "author:writes:paper" type: "author:writes:paper"
...@@ -308,37 +319,35 @@ class OnDiskDataset(Dataset): ...@@ -308,37 +319,35 @@ class OnDiskDataset(Dataset):
- name: "edge_classification" - name: "edge_classification"
num_classes: 10 num_classes: 10
train_set: train_set:
- type: paper # could be null for homogeneous graph. - data: # multiple data sources could be specified.
data: # multiple data sources could be specified. - type: paper
- name: node_pairs name: node_pairs
format: numpy format: numpy # Can be numpy or torch.
in_memory: true # If not specified, default to true. in_memory: true # If not specified, default to true.
path: set/paper-train-node_pairs.npy path: set/paper-train-node_pairs.npy
- name: labels - type: paper
name: labels
format: numpy format: numpy
in_memory: false
path: set/paper-train-labels.npy path: set/paper-train-labels.npy
validation_set: validation_set:
- type: paper - data:
data: - type: paper
- name: node_pairs name: node_pairs
format: numpy format: numpy
in_memory: true
path: set/paper-validation-node_pairs.npy path: set/paper-validation-node_pairs.npy
- name: labels - type: paper
name: labels
format: numpy format: numpy
in_memory: true
path: set/paper-validation-labels.npy path: set/paper-validation-labels.npy
test_set: test_set:
- type: paper - data:
data: - type: paper
- name: node_pairs name: node_pairs
format: numpy format: numpy
in_memory: true
path: set/paper-test-node_pairs.npy path: set/paper-test-node_pairs.npy
- name: labels - type: paper
name: labels
format: numpy format: numpy
in_memory: true
path: set/paper-test-labels.npy path: set/paper-test-labels.npy
Parameters Parameters
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment