Unverified Commit 8a8d36e9 authored by Minjie Wang's avatar Minjie Wang Committed by GitHub
Browse files

[Doc] add doc for DGLCSVDataset

[Doc] add doc for DGLCSVDataset
parents 39121dfd 6ce3c178
...@@ -18,6 +18,10 @@ Base Dataset Class ...@@ -18,6 +18,10 @@ Base Dataset Class
.. autoclass:: DGLDataset .. autoclass:: DGLDataset
:members: download, save, load, process, has_cache, __getitem__, __len__ :members: download, save, load, process, has_cache, __getitem__, __len__
CSV Dataset Class
-----------------
.. autoclass:: CSVDataset
Node Prediction Datasets Node Prediction Datasets
--------------------------------------- ---------------------------------------
......
.. _guide-data-pipeline-loadcsv:
4.6 Loading data from CSV files
----------------------------------------------
Comma Separated Value (CSV) is a widely used data storage format. DGL provides
:class:`~dgl.data.CSVDataset` for loading and parsing graph data stored in
CSV format.
To create a ``CSVDataset`` object:
.. code:: python
import dgl
ds = dgl.data.CSVDataset('/path/to/dataset')
The returned ``ds`` object is a standard :class:`~dgl.data.DGLDataset`. For
example, one can get graph samples using ``__getitem__`` as well as node/edge
features using ``ndata``/``edata``.
.. code:: python
# A demonstration of how to use the loaded dataset. The feature names
# may vary depending on the CSV contents.
g = ds[0] # get the graph
label = g.ndata['label']
feat = g.ndata['feat']
Data folder structure
~~~~~~~~~~~~~~~~~~~~~
.. code::
/path/to/dataset/
|-- meta.yaml # metadata of the dataset
|-- edges_0.csv # edge data including src_id, dst_id, feature, label and so on
|-- ... # you can have as many CSVs for edge data as you want
|-- nodes_0.csv # node data including node_id, feature, label and so on
|-- ... # you can have as many CSVs for node data as you want
|-- graphs.csv # graph-level features
Node/edge/graph-level data are stored in CSV files. ``meta.yaml`` is a metadata file specifying
where to read nodes/edges/graphs data and how to parse them to construct the dataset
object. A minimal data folder contains one ``meta.yaml`` and two CSVs, one for node data and one
for edge data, in which case the dataset contains only a single graph with no graph-level data.
Dataset of a single feature-less graph
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
When the dataset contains only one graph with no node or edge features, there need only three
files in the data folder: ``meta.yaml``, one CSV for node IDs and one CSV for edges:
.. code::
./mini_featureless_dataset/
|-- meta.yaml
|-- nodes.csv
|-- edges.csv
``meta.yaml`` contains the following information:
.. code:: yaml
dataset_name: mini_featureless_dataset
edge_data:
- file_name: edges.csv
node_data:
- file_name: nodes.csv
``nodes.csv`` lists the node IDs under the ``node_id`` field:
.. code::
node_id
0
1
2
3
4
``edges.csv`` lists all the edges in two columns (``src_id`` and ``dst_id``) specifying the
source and destination node ID of each edge:
.. code::
src_id,dst_id
4,4
4,1
3,0
4,1
4,0
1,2
1,3
3,3
1,1
4,1
After loaded, the dataset has one graph without any features:
.. code:: python
>>> import dgl
>>> dataset = dgl.data.CSVDataset('./mini_featureless_dataset')
>>> g = dataset[0] # only one graph
>>> print(g)
Graph(num_nodes=5, num_edges=10,
ndata_schemes={}
edata_schemes={})
.. note::
Non-integer node IDs are allowed. When constructing the graph, ``CSVDataset`` will
map each raw ID to an integer ID starting from zero.
.. note::
Edges are always directed. To have both directions, add reversed edges in the edge
CSV file or use :class:`~dgl.transform.AddReverse` to transform the loaded graph.
A graph without any feature is often of less interest. In the next example, we will show
how to load and parse node or edge features.
Dataset of a single graph with features and labels
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
When the dataset contains a single graph with node or edge features and labels, there still
need only three files in the data folder: ``meta.yaml``, one CSV for node IDs and one CSV
for edges:
.. code::
./mini_feature_dataset/
|-- meta.yaml
|-- nodes.csv
|-- edges.csv
``meta.yaml``:
.. code:: yaml
dataset_name: mini_feature_dataset
edge_data:
- file_name: edges.csv
node_data:
- file_name: nodes.csv
``edges.csv`` with five synthetic edge data (``label``, ``train_mask``, ``val_mask``, ``test_mask``, ``feat``):
.. code::
src_id,dst_id,label,train_mask,val_mask,test_mask,feat
4,0,2,False,True,True,"0.5477868606453535, 0.4470617033458436, 0.936706701616337"
4,0,0,False,False,True,"0.9794634290792008, 0.23682038840665198, 0.049629338970987646"
0,3,1,True,True,True,"0.8586722047523594, 0.5746912787380253, 0.6462162561249654"
0,1,2,True,False,False,"0.2730008213674695, 0.5937484188166621, 0.765544096939567"
0,2,1,True,True,True,"0.45441619816038514, 0.1681403185591509, 0.9952376085297715"
0,0,0,False,False,False,"0.4197669213305396, 0.849983324532477, 0.16974127573016262"
2,2,1,False,True,True,"0.5495035052928215, 0.21394654203489705, 0.7174910641836348"
1,0,2,False,True,False,"0.008790817766266334, 0.4216530595907526, 0.529195480661293"
3,0,0,True,True,True,"0.6598715708878852, 0.1932390907048961, 0.9774471538377553"
4,0,1,False,False,False,"0.16846068931179736, 0.41516080644186737, 0.002158116134429955"
``nodes.csv`` with five synthetic node data (``label``, ``train_mask``, ``val_mask``, ``test_mask``, ``feat``):
.. code::
node_id,label,train_mask,val_mask,test_mask,feat
0,1,False,True,True,"0.07816474278491703, 0.9137336384979067, 0.4654086994009452"
1,1,True,True,True,"0.05354099924658973, 0.8753101998792645, 0.33929432608774135"
2,1,True,False,True,"0.33234211884156384, 0.9370522452510665, 0.6694943496824788"
3,0,False,True,False,"0.9784264442230887, 0.22131880861864428, 0.3161154827254189"
4,1,True,True,False,"0.23142237259162102, 0.8715767748481147, 0.19117861103555467"
After loaded, the dataset has one graph. Node/edge features are stored in ``ndata`` and ``edata``
with the same column names. The example demonstrates how to specify a vector-shaped feature
using comma-separated list enclosed by double quotes ``"..."``.
.. code:: python
>>> import dgl
>>> dataset = dgl.data.CSVDataset('./mini_feature_dataset')
>>> g = dataset[0] # only one graph
>>> print(g)
Graph(num_nodes=5, num_edges=10,
ndata_schemes={'label': Scheme(shape=(), dtype=torch.int64), 'train_mask': Scheme(shape=(), dtype=torch.bool), 'val_mask': Scheme(shape=(), dtype=torch.bool), 'test_mask': Scheme(shape=(), dtype=torch.bool), 'feat': Scheme(shape=(3,), dtype=torch.float64)}
edata_schemes={'label': Scheme(shape=(), dtype=torch.int64), 'train_mask': Scheme(shape=(), dtype=torch.bool), 'val_mask': Scheme(shape=(), dtype=torch.bool), 'test_mask': Scheme(shape=(), dtype=torch.bool), 'feat': Scheme(shape=(3,), dtype=torch.float64)})
.. note::
By default, ``CSVDatatset`` assumes all feature data to be numerical values (e.g., int, float, bool or
list) and missing values are not allowed. Users could provide custom data parser for these cases.
See `Custom Data Parser`_ for more details.
Dataset of a single heterogeneous graph
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
One can specify multiple node and edge CSV files (each for one type) to represent a heterogeneous graph.
Here is an example data with two node types and two edge types:
.. code::
./mini_hetero_dataset/
|-- meta.yaml
|-- nodes_0.csv
|-- nodes_1.csv
|-- edges_0.csv
|-- edges_1.csv
The ``meta.yaml`` specifies the node type name (using ``ntype``) and edge type name (using ``etype``)
of each CSV file. The edge type name is a string triplet containing the source node type name, relation
name and the destination node type name.
.. code:: yaml
dataset_name: mini_hetero_dataset
edge_data:
- file_name: edges_0.csv
etype: [user, follow, user]
- file_name: edges_1.csv
etype: [user, like, item]
node_data:
- file_name: nodes_0.csv
ntype: user
- file_name: nodes_1.csv
ntype: item
The node and edge CSV files follow the same format as in homogeneous graphs. Here are some synthetic
data for demonstration purposes:
``edges_0.csv`` and ``edges_1.csv``:
.. code::
src_id,dst_id,label,feat
4,4,1,"0.736833152378035,0.10522806046048205,0.9418796835016118"
3,4,2,"0.5749339182767451,0.20181320245665535,0.490938012147181"
1,4,2,"0.7697294432580938,0.49397782380750765,0.10864079337442234"
0,4,0,"0.1364240150959487,0.1393107840629273,0.7901988878812207"
2,3,1,"0.42988138237505735,0.18389137408509248,0.18431292077750894"
0,4,2,"0.8613368738351794,0.67985810014162,0.6580438064356824"
2,4,1,"0.6594951663841697,0.26499036865016423,0.7891429392727503"
4,1,0,"0.36649684241348557,0.9511783938523962,0.8494919263589972"
1,1,2,"0.698592283371875,0.038622249776255946,0.5563827995742111"
0,4,1,"0.5227112950269823,0.3148264185956532,0.47562693094002173"
``nodes_0.csv`` and ``nodes_1.csv``:
.. code::
node_id,label,feat
0,2,"0.5400687466285844,0.7588441197954202,0.4268254673041745"
1,1,"0.08680051341900807,0.11446843700743892,0.7196969604886617"
2,2,"0.8964389655603473,0.23368113896545695,0.8813472954005022"
3,1,"0.5454703921677284,0.7819383771535038,0.3027939452162367"
4,1,"0.5365210052235699,0.8975240205792763,0.7613943085507672"
After loaded, the dataset has one heterograph with features and labels:
.. code:: python
>>> import dgl
>>> dataset = dgl.data.CSVDataset('./mini_hetero_dataset')
>>> g = dataset[0] # only one graph
>>> print(g)
Graph(num_nodes={'item': 5, 'user': 5},
num_edges={('user', 'follow', 'user'): 10, ('user', 'like', 'item'): 10},
metagraph=[('user', 'user', 'follow'), ('user', 'item', 'like')])
>>> g.nodes['user'].data
{'label': tensor([2, 1, 2, 1, 1]), 'feat': tensor([[0.5401, 0.7588, 0.4268],
[0.0868, 0.1145, 0.7197],
[0.8964, 0.2337, 0.8813],
[0.5455, 0.7819, 0.3028],
[0.5365, 0.8975, 0.7614]], dtype=torch.float64)}
>>> g.edges['like'].data
{'label': tensor([1, 2, 2, 0, 1, 2, 1, 0, 2, 1]), 'feat': tensor([[0.7368, 0.1052, 0.9419],
[0.5749, 0.2018, 0.4909],
[0.7697, 0.4940, 0.1086],
[0.1364, 0.1393, 0.7902],
[0.4299, 0.1839, 0.1843],
[0.8613, 0.6799, 0.6580],
[0.6595, 0.2650, 0.7891],
[0.3665, 0.9512, 0.8495],
[0.6986, 0.0386, 0.5564],
[0.5227, 0.3148, 0.4756]], dtype=torch.float64)}
Dataset of multiple graphs
~~~~~~~~~~~~~~~~~~~~~~~~~~
When there are multiple graphs, one can include an additional CSV file for storing graph-level features.
Here is an example:
.. code::
./mini_multi_dataset/
|-- meta.yaml
|-- nodes.csv
|-- edges.csv
|-- graphs.csv
Accordingly, the ``meta.yaml`` should include an extra ``graph_data`` key to tell which CSV file to
load graph-level features from.
.. code:: yaml
dataset_name: mini_multi_dataset
edge_data:
- file_name: edges.csv
node_data:
- file_name: nodes.csv
graph_data:
- file_name: graphs.csv
To distinguish nodes and edges of different graphs, the ``node.csv`` and ``edge.csv`` must contain
an extra column ``graph_id``:
``edges.csv``:
.. code::
graph_id,src_id,dst_id,feat
0,0,4,"0.39534097273254654,0.9422093637539785,0.634899790318452"
0,3,0,"0.04486384200747007,0.6453746567017163,0.8757520744192612"
0,3,2,"0.9397636966928355,0.6526403892728874,0.8643238446466464"
0,1,1,"0.40559906615287566,0.9848072295736628,0.493888090726854"
0,4,1,"0.253458867276219,0.9168191778828504,0.47224962583565544"
0,0,1,"0.3219496197945605,0.3439899477636117,0.7051530741717352"
0,2,1,"0.692873149428549,0.4770019763881086,0.21937428942781778"
0,4,0,"0.620118223673067,0.08691420300562658,0.86573472329756"
0,2,1,"0.00743445923710373,0.5251800239734318,0.054016385555202384"
0,4,1,"0.6776417760682221,0.7291568018841328,0.4523600060547709"
1,1,3,"0.6375445528248924,0.04878384701995819,0.4081642382536248"
1,0,4,"0.776002616178397,0.8851294998284638,0.7321742043493028"
1,1,0,"0.0928555079874982,0.6156748364694707,0.6985674921582508"
1,0,2,"0.31328748118329997,0.8326121496142408,0.04133991340612775"
1,1,0,"0.36786902637778773,0.39161865931662243,0.9971749359397111"
1,1,1,"0.4647410679872376,0.8478810655406659,0.6746269314422184"
1,0,2,"0.8117650553546695,0.7893727601272978,0.41527155506593394"
1,1,3,"0.40707309111756307,0.2796588354307046,0.34846782265758314"
1,1,0,"0.18626464175355095,0.3523777809254057,0.7863421810531344"
1,3,0,"0.28357022069634585,0.13774964202156292,0.5913335505943637"
``nodes.csv``:
.. code::
graph_id,node_id,feat
0,0,"0.5725330322207948,0.8451870383322376,0.44412796119211184"
0,1,"0.6624186423087752,0.6118386331195641,0.7352138669985214"
0,2,"0.7583372765843964,0.15218126307872892,0.6810484348765842"
0,3,"0.14627522432017592,0.7457985352827006,0.1037097085190507"
0,4,"0.49037522512771525,0.8778998699783784,0.0911194482288028"
1,0,"0.11158102039672668,0.08543289788089736,0.6901745368284345"
1,1,"0.28367647637469273,0.07502571020414439,0.01217200152200748"
1,2,"0.2472495901894738,0.24285506608575758,0.6494437360242048"
1,3,"0.5614197853127827,0.059172654879085296,0.4692371689047904"
1,4,"0.17583413999295983,0.5191278830882644,0.8453123358491914"
The ``graphs.csv`` contains a ``graph_id`` column and arbitrary number of feature columns.
The example dataset here has two graphs, each with a ``feat`` and a ``label`` graph-level
data.
.. code::
graph_id,feat,label
0,"0.7426272601929126,0.5197462471155317,0.8149104951283953",0
1,"0.534822233529295,0.2863627767733977,0.1154897249106891",0
After loaded, the dataset has multiple homographs with features and labels:
.. code:: python
>>> import dgl
>>> dataset = dgl.data.CSVDataset('./mini_multi_dataset')
>>> print(len(dataset))
2
>>> graph0, data0 = dataset[0]
>>> print(graph0)
Graph(num_nodes=5, num_edges=10,
ndata_schemes={'feat': Scheme(shape=(3,), dtype=torch.float64)}
edata_schemes={'feat': Scheme(shape=(3,), dtype=torch.float64)})
>>> print(data0)
{'feat': tensor([0.7426, 0.5197, 0.8149], dtype=torch.float64), 'label': tensor(0)}
>>> graph1, data1 = dataset[1]
>>> print(graph1)
Graph(num_nodes=5, num_edges=10,
ndata_schemes={'feat': Scheme(shape=(3,), dtype=torch.float64)}
edata_schemes={'feat': Scheme(shape=(3,), dtype=torch.float64)})
>>> print(data1)
{'feat': tensor([0.5348, 0.2864, 0.1155], dtype=torch.float64), 'label': tensor(0)}
Custom Data Parser
~~~~~~~~~~~~~~~~~~
By default, ``CSVDataset`` assumes that all the stored node-/edge-/graph- level data are numerical
values. Users can provide custom ``DataParser`` to ``CSVDataset`` to handle more complex
data type. A ``DataParser`` needs to implement the ``__call__`` method which takes in the
:class:`pandas.DataFrame` object created from CSV file and should return a dictionary of
parsed feature data. The parsed feature data will be saved to the ``ndata`` and ``edata`` of
the corresponding ``DGLGraph`` object, and thus must be tensors or numpy arrays. Below shows an example
``DataParser`` which converts string type labels to integers:
Given a dataset as follows,
.. code::
./customized_parser_dataset/
|-- meta.yaml
|-- nodes.csv
|-- edges.csv
``meta.yaml``:
.. code:: yaml
dataset_name: customized_parser_dataset
edge_data:
- file_name: edges.csv
node_data:
- file_name: nodes.csv
``edges.csv``:
.. code::
src_id,dst_id,label
4,0,positive
4,0,negative
0,3,positive
0,1,positive
0,2,negative
0,0,positive
2,2,negative
1,0,positive
3,0,negative
4,0,positive
``nodes.csv``:
.. code::
node_id,label
0,positive
1,negative
2,positive
3,negative
4,positive
To parse the string type labels, one can define a ``DataParser`` class as follows:
.. code:: python
import numpy as np
import pandas as pd
class MyDataParser:
def __call__(self, df: pd.DataFrame):
parsed = {}
for header in df:
if 'Unnamed' in header: # Handle Unnamed column
print("Unamed column is found. Ignored...")
continue
dt = df[header].to_numpy().squeeze()
if header == 'label':
dt = np.array([1 if e == 'positive' else 0 for e in dt])
parsed[header] = dt
return parsed
Create a ``CSVDataset`` using the defined ``DataParser``:
.. code:: python
>>> import dgl
>>> dataset = dgl.data.CSVDataset('./customized_parser_dataset',
... ndata_parser=MyDataParser(),
... edata_parser=MyDataParser())
>>> print(dataset[0].ndata['label'])
tensor([1, 0, 1, 0, 1])
>>> print(dataset[0].edata['label'])
tensor([1, 0, 1, 1, 0, 1, 0, 1, 0, 1])
.. note::
To specify different ``DataParser``\s for different node/edge types, pass a dictionary to
``ndata_parser`` and ``edata_parser``, where the key is type name (a single string for
node type; a string triplet for edge type) and the value is the ``DataParser`` to use.
Full YAML Specification
~~~~~~~~~~~~~~~~~~~~~~~
``CSVDataset`` allows more flexible control over the loading and parsing process. For example, one
can change the ID column names via ``meta.yaml``. The example below lists all the supported keys.
.. code:: yaml
version: 1.0.0
dataset_name: some_complex_data
separator: ',' # CSV separator symbol. Default: ','
edge_data:
- file_name: edges_0.csv
etype: [user, follow, user]
src_id_field: src_id # Column name for source node IDs. Default: src_id
dst_id_field: dst_id # Column name for destination node IDs. Default: dst_id
- file_name: edges_1.csv
etype: [user, like, item]
src_id_field: src_id
dst_id_field: dst_id
node_data:
- file_name: nodes_0.csv
ntype: user
node_id_field: node_id # Column name for node IDs. Default: node_id
- file_name: nodes_1.csv
ntype: item
node_id_field: node_id # Column name for node IDs. Default: node_id
graph_data:
file_name: graphs.csv
graph_id_field: graph_id # Column name for graph IDs. Default: graph_id
Top-level
^^^^^^^^^^^^^^
At the top level, only 6 keys are available:
- ``version``: Optional. String.
It specifies which version of ``meta.yaml`` is used. More feature may be added in the future.
- ``dataset_name``: Required. String.
It specifies the dataset name.
- ``separator``: Optional. String.
It specifies how to parse data in CSV files. Default: ``','``.
- ``edge_data``: Required. List of ``EdgeData``.
Meta data for parsing edge CSV files.
- ``node_data``: Required. List of ``NodeData``.
Meta data for parsing node CSV files.
- ``graph_data``: Optional. ``GraphData``.
Meta data for parsing the graph CSV file.
``EdgeData``
^^^^^^^^^^^^^^^^^^^^^^
There are 4 keys:
- ``file_name``: Required. String.
The CSV file to load data from.
- ``etype``: Optional. List of string.
Edge type name in string triplet: [source node type, relation type, destination node type].
- ``src_id_field``: Optional. String.
Which column to read for source node IDs. Default: ``src_id``.
- ``dst_id_field``: Optional. String.
Which column to read for destination node IDs. Default: ``dst_id``.
``NodeData``
^^^^^^^^^^^^^^^^^^^^^^
There are 3 keys:
- ``file_name``: Required. String.
The CSV file to load data from.
- ``ntype``: Optional. String.
Node type name.
- ``node_id_field``: Optional. String.
Which column to read for node IDs. Default: ``node_id``.
``GraphData``
^^^^^^^^^^^^^^^^^^^^^^
There are 2 keys:
- ``file_name``: Required. String.
The CSV file to load data from.
- ``graph_id_field``: Optional. String.
Which column to read for graph IDs. Default: ``graph_id``.
\ No newline at end of file
...@@ -23,6 +23,7 @@ shows how to implement each component of it. ...@@ -23,6 +23,7 @@ shows how to implement each component of it.
* :ref:`guide-data-pipeline-process` * :ref:`guide-data-pipeline-process`
* :ref:`guide-data-pipeline-savenload` * :ref:`guide-data-pipeline-savenload`
* :ref:`guide-data-pipeline-loadogb` * :ref:`guide-data-pipeline-loadogb`
* :ref:`guide-data-pipeline-loadcsv`
.. toctree:: .. toctree::
:maxdepth: 1 :maxdepth: 1
...@@ -34,3 +35,4 @@ shows how to implement each component of it. ...@@ -34,3 +35,4 @@ shows how to implement each component of it.
data-process data-process
data-savenload data-savenload
data-loadogb data-loadogb
data-loadcsv
\ No newline at end of file
...@@ -6,8 +6,7 @@ from ..base import DGLError ...@@ -6,8 +6,7 @@ from ..base import DGLError
class CSVDataset(DGLDataset): class CSVDataset(DGLDataset):
""" This class aims to parse data from CSV files, construct DGLGraph """Dataset class that loads and parses graph data from CSV files.
and behaves as a DGLDataset.
Parameters Parameters
---------- ----------
...@@ -51,7 +50,9 @@ class CSVDataset(DGLDataset): ...@@ -51,7 +50,9 @@ class CSVDataset(DGLDataset):
any available graph-level data such as graph-level feature, labels. any available graph-level data such as graph-level feature, labels.
Examples Examples
[TODO]: link to a detailed web page. --------
Please refer to :ref:`guide-data-pipeline-loadcsv`.
""" """
META_YAML_NAME = 'meta.yaml' META_YAML_NAME = 'meta.yaml'
......
...@@ -222,6 +222,16 @@ dataset = SyntheticDataset() ...@@ -222,6 +222,16 @@ dataset = SyntheticDataset()
graph, label = dataset[0] graph, label = dataset[0]
print(graph, label) print(graph, label)
######################################################################
# Creating Dataset from CSV via :class:`~dgl.data.CSVDataset`
# ------------------------------------------------------------
#
# The previous examples describe how to create a dataset from CSV files
# step-by-step. DGL also provides a utility class :class:`~dgl.data.CSVDataset`
# for reading and parsing data from CSV files. See :ref:`guide-data-pipeline-loadcsv`
# for more details.
#
# Thumbnail credits: (Un)common Use Cases for Graph Databases, Michal Bachman # Thumbnail credits: (Un)common Use Cases for Graph Databases, Michal Bachman
# sphinx_gallery_thumbnail_path = '_static/blitz_6_load_data.png' # sphinx_gallery_thumbnail_path = '_static/blitz_6_load_data.png'
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment