Unverified Commit 3e59b1d4 authored by Rhett Ying's avatar Rhett Ying Committed by GitHub
Browse files

[doc] update OnDiskDataset (#6811)

parent 7094ff4f
.. _stochastic_training-ondisk-dataset:
Creating OnDiskDataset
======================
Composing OnDiskDataset from raw data
=====================================
This tutorial shows how to create an `OnDiskDataset` from raw data and use it
for stochastic training.
This tutorial shows how to compose :class:`~dgl.graphbolt.OnDiskDataset` from
raw data. A full specification of ``metadata.yaml`` is also provided.
**GraphBolt** provides the ``OnDiskDataset`` class to help user organize plain
data of graph strucutre, feature data and tasks. ``OnDiskDataset`` is also
designed to efficiently handle large graphs and features that do not fit into
memory by storing them on disk.
For more details about `OnDiskDataset`, please refer to the
:class:`~dgl.graphbolt.OnDiskDataset` API documentation.
.. toctree::
:maxdepth: 1
:glob:
......
......@@ -25,6 +25,7 @@
"This tutorial shows how to create `OnDiskDataset` for heterogeneous graph that could be used in **GraphBolt** framework. The major difference from creating dataset for homogeneous graph is that we need to specify node/edge types for edges, feature data, training/validation/test sets.\n",
"\n",
"By the end of this tutorial, you will be able to\n",
"\n",
"- organize graph structure data.\n",
"- organize feature data.\n",
"- organize training/validation/test set for specific tasks.\n",
......@@ -104,7 +105,7 @@
"### Generate graph structure data\n",
"For heterogeneous graph, we need to save different edge edges(namely node pairs) into separate **CSV** files.\n",
"\n",
"Note:\n",
"**Note**:\n",
"when saving to file, do not save index and header.\n"
],
"metadata": {
......
......@@ -25,6 +25,7 @@
"This tutorial shows how to create `OnDiskDataset` for homogeneous graph that could be used in **GraphBolt** framework.\n",
"\n",
"By the end of this tutorial, you will be able to\n",
"\n",
"- organize graph structure data.\n",
"- organize feature data.\n",
"- organize training/validation/test set for specific tasks.\n",
......
......@@ -281,22 +281,33 @@ class OnDiskDataset(Dataset):
Due to limited resources, the data which are too large to fit into RAM will
remain on disk while others reside in RAM once ``OnDiskDataset`` is
initialized. This behavior could be controled by user via ``in_memory``
field in YAML file.
field in YAML file. All paths in YAML file are relative paths to the
dataset directory.
A full example of YAML file is as follows:
.. code-block:: yaml
dataset_name: graphbolt_test
graph_topology:
type: FusedCSCSamplingGraph
path: graph_topology/fused_csc_sampling_graph.tar
graph:
nodes:
- type: paper # could be omitted for homogeneous graph.
num: 1000
- type: author
num: 1000
edges:
- type: author:writes:paper # could be omitted for homogeneous graph.
format: csv # Can be csv only.
path: edge_data/author-writes-paper.csv
- type: paper:cites:paper
format: csv
path: edge_data/paper-cites-paper.csv
feature_data:
- domain: node
type: paper
type: paper # could be omitted for homogeneous graph.
name: feat
format: numpy
in_memory: false
in_memory: false # If not specified, default to true.
path: node_data/paper-feat.npy
- domain: edge
type: "author:writes:paper"
......@@ -308,37 +319,35 @@ class OnDiskDataset(Dataset):
- name: "edge_classification"
num_classes: 10
train_set:
- type: paper # could be null for homogeneous graph.
data: # multiple data sources could be specified.
- name: node_pairs
format: numpy
- data: # multiple data sources could be specified.
- type: paper
name: node_pairs
format: numpy # Can be numpy or torch.
in_memory: true # If not specified, default to true.
path: set/paper-train-node_pairs.npy
- name: labels
- type: paper
name: labels
format: numpy
in_memory: false
path: set/paper-train-labels.npy
validation_set:
- data:
- type: paper
data:
- name: node_pairs
name: node_pairs
format: numpy
in_memory: true
path: set/paper-validation-node_pairs.npy
- name: labels
- type: paper
name: labels
format: numpy
in_memory: true
path: set/paper-validation-labels.npy
test_set:
- data:
- type: paper
data:
- name: node_pairs
name: node_pairs
format: numpy
in_memory: true
path: set/paper-test-node_pairs.npy
- name: labels
- type: paper
name: labels
format: numpy
in_memory: true
path: set/paper-test-labels.npy
Parameters
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment