[Doc] add doc for DGLCSVDataset

[Doc] add doc for DGLCSVDataset

[Doc] add doc for DGLCSVDataset
8a8d36e9 · Minjie Wang · GitHub · 39121dfd · 6ce3c178 · 8a8d36e9
Unverified Commit 8a8d36e9 authored Feb 17, 2022 by Minjie Wang Committed by GitHub Feb 17, 2022
5 changed files
--- a/docs/source/api/python/dgl.data.rst
+++ b/docs/source/api/python/dgl.data.rst
@@ -18,6 +18,10 @@ Base Dataset Class
 .. autoclass:: DGLDataset
    :members: download, save, load, process, has_cache, __getitem__, __len__
+CSV Dataset Class
+-----------------
+.. autoclass:: CSVDataset
 Node Prediction Datasets
 ---------------------------------------

--- a/docs/source/guide/data-loadcsv.rst
+++ b/docs/source/guide/data-loadcsv.rst
+.. _guide-data-pipeline-loadcsv:
+4.6 Loading data from CSV files
+----------------------------------------------
+Comma Separated Value (CSV) is a widely used data storage format. DGL provides
+:class:`~dgl.data.CSVDataset` for loading and parsing graph data stored in
+CSV format.
+To create a ``CSVDataset`` object:
+.. code:: python
+    import dgl
+    ds = dgl.data.CSVDataset('/path/to/dataset')
+The returned ``ds`` object is a standard :class:`~dgl.data.DGLDataset`. For
+example, one can get graph samples using ``__getitem__`` as well as node/edge
+features using ``ndata``/``edata``.
+.. code:: python
+    # A demonstration of how to use the loaded dataset. The feature names
+    # may vary depending on the CSV contents.
+    g = ds[0] # get the graph
+    label = g.ndata['label']
+    feat = g.ndata['feat']
+Data folder structure
+~~~~~~~~~~~~~~~~~~~~~
+.. code::
+    /path/to/dataset/
+    |-- meta.yaml     # metadata of the dataset
+    |-- edges_0.csv   # edge data including src_id, dst_id, feature, label and so on
+    |-- ...           # you can have as many CSVs for edge data as you want
+    |-- nodes_0.csv   # node data including node_id, feature, label and so on
+    |-- ...           # you can have as many CSVs for node data as you want
+    |-- graphs.csv    # graph-level features
+Node/edge/graph-level data are stored in CSV files. ``meta.yaml`` is a metadata file specifying
+where to read nodes/edges/graphs data and how to parse them to construct the dataset
+object. A minimal data folder contains one ``meta.yaml`` and two CSVs, one for node data and one
+for edge data, in which case the dataset contains only a single graph with no graph-level data.
+Dataset of a single feature-less graph
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+When the dataset contains only one graph with no node or edge features, there need only three
+files in the data folder: ``meta.yaml``, one CSV for node IDs and one CSV for edges:
+.. code::
+    ./mini_featureless_dataset/
+    |-- meta.yaml
+    |-- nodes.csv
+    |-- edges.csv
+``meta.yaml`` contains the following information:
+.. code:: yaml
+    dataset_name: mini_featureless_dataset
+    edge_data:
+    - file_name: edges.csv
+    node_data:
+    - file_name: nodes.csv
+``nodes.csv`` lists the node IDs under the ``node_id`` field:
+.. code::
+    node_id
+    0
+    1
+    2
+    3
+    4
+``edges.csv`` lists all the edges in two columns (``src_id`` and ``dst_id``) specifying the
+source and destination node ID of each edge:
+.. code::
+    src_id,dst_id
+    4,4
+    4,1
+    3,0
+    4,1
+    4,0
+    1,2
+    1,3
+    3,3
+    1,1
+    4,1
+After loaded, the dataset has one graph without any features:
+.. code:: python
+    >>> import dgl
+    >>> dataset = dgl.data.CSVDataset('./mini_featureless_dataset')
+    >>> g = dataset[0]  # only one graph
+    >>> print(g)
+    Graph(num_nodes=5, num_edges=10,
+          ndata_schemes={}
+          edata_schemes={})
+.. note::
+    Non-integer node IDs are allowed. When constructing the graph, ``CSVDataset`` will
+    map each raw ID to an integer ID starting from zero.
+.. note::
+    Edges are always directed. To have both directions, add reversed edges in the edge
+    CSV file or use :class:`~dgl.transform.AddReverse` to transform the loaded graph.
+A graph without any feature is often of less interest. In the next example, we will show
+how to load and parse node or edge features.
+Dataset of a single graph with features and labels
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+When the dataset contains a single graph with node or edge features and labels, there still
+need only three files in the data folder: ``meta.yaml``, one CSV for node IDs and one CSV
+for edges:
+.. code::
+    ./mini_feature_dataset/
+    |-- meta.yaml
+    |-- nodes.csv
+    |-- edges.csv
+``meta.yaml``:
+.. code:: yaml
+    dataset_name: mini_feature_dataset
+    edge_data:
+    - file_name: edges.csv
+    node_data:
+    - file_name: nodes.csv
+``edges.csv`` with five synthetic edge data (``label``, ``train_mask``, ``val_mask``, ``test_mask``, ``feat``):
+.. code::
+    src_id,dst_id,label,train_mask,val_mask,test_mask,feat
+    4,0,2,False,True,True,"0.5477868606453535, 0.4470617033458436, 0.936706701616337"
+    4,0,0,False,False,True,"0.9794634290792008, 0.23682038840665198, 0.049629338970987646"
+    0,3,1,True,True,True,"0.8586722047523594, 0.5746912787380253, 0.6462162561249654"
+    0,1,2,True,False,False,"0.2730008213674695, 0.5937484188166621, 0.765544096939567"
+    0,2,1,True,True,True,"0.45441619816038514, 0.1681403185591509, 0.9952376085297715"
+    0,0,0,False,False,False,"0.4197669213305396, 0.849983324532477, 0.16974127573016262"
+    2,2,1,False,True,True,"0.5495035052928215, 0.21394654203489705, 0.7174910641836348"
+    1,0,2,False,True,False,"0.008790817766266334, 0.4216530595907526, 0.529195480661293"
+    3,0,0,True,True,True,"0.6598715708878852, 0.1932390907048961, 0.9774471538377553"
+    4,0,1,False,False,False,"0.16846068931179736, 0.41516080644186737, 0.002158116134429955"
+``nodes.csv`` with five synthetic node data (``label``, ``train_mask``, ``val_mask``, ``test_mask``, ``feat``):
+.. code::
+    node_id,label,train_mask,val_mask,test_mask,feat
+    0,1,False,True,True,"0.07816474278491703, 0.9137336384979067, 0.4654086994009452"
+    1,1,True,True,True,"0.05354099924658973, 0.8753101998792645, 0.33929432608774135"
+    2,1,True,False,True,"0.33234211884156384, 0.9370522452510665, 0.6694943496824788"
+    3,0,False,True,False,"0.9784264442230887, 0.22131880861864428, 0.3161154827254189"
+    4,1,True,True,False,"0.23142237259162102, 0.8715767748481147, 0.19117861103555467"
+After loaded, the dataset has one graph. Node/edge features are stored in ``ndata`` and ``edata``
+with the same column names. The example demonstrates how to specify a vector-shaped feature
+using comma-separated list enclosed by double quotes ``"..."``.
+.. code:: python
+    >>> import dgl
+    >>> dataset = dgl.data.CSVDataset('./mini_feature_dataset')
+    >>> g = dataset[0]  # only one graph
+    >>> print(g)
+    Graph(num_nodes=5, num_edges=10,
+          ndata_schemes={'label': Scheme(shape=(), dtype=torch.int64), 'train_mask': Scheme(shape=(), dtype=torch.bool), 'val_mask': Scheme(shape=(), dtype=torch.bool), 'test_mask': Scheme(shape=(), dtype=torch.bool), 'feat': Scheme(shape=(3,), dtype=torch.float64)}
+          edata_schemes={'label': Scheme(shape=(), dtype=torch.int64), 'train_mask': Scheme(shape=(), dtype=torch.bool), 'val_mask': Scheme(shape=(), dtype=torch.bool), 'test_mask': Scheme(shape=(), dtype=torch.bool), 'feat': Scheme(shape=(3,), dtype=torch.float64)})
+.. note::
+    By default, ``CSVDatatset`` assumes all feature data to be numerical values (e.g., int, float, bool or
+    list) and missing values are not allowed. Users could provide custom data parser for these cases.
+    See `Custom Data Parser`_ for more details.
+Dataset of a single heterogeneous graph
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+One can specify multiple node and edge CSV files (each for one type) to represent a heterogeneous graph.
+Here is an example data with two node types and two edge types:
+.. code::
+    ./mini_hetero_dataset/
+    |-- meta.yaml
+    |-- nodes_0.csv
+    |-- nodes_1.csv
+    |-- edges_0.csv
+    |-- edges_1.csv
+The ``meta.yaml`` specifies the node type name (using ``ntype``) and edge type name (using ``etype``)
+of each CSV file. The edge type name is a string triplet containing the source node type name, relation
+name and the destination node type name.
+.. code:: yaml
+    dataset_name: mini_hetero_dataset
+    edge_data:
+    - file_name: edges_0.csv
+      etype: [user, follow, user]
+    - file_name: edges_1.csv
+      etype: [user, like, item]
+    node_data:
+    - file_name: nodes_0.csv
+      ntype: user
+    - file_name: nodes_1.csv
+      ntype: item
+The node and edge CSV files follow the same format as in homogeneous graphs. Here are some synthetic
+data for demonstration purposes:
+``edges_0.csv`` and ``edges_1.csv``:
+.. code::
+    src_id,dst_id,label,feat
+    4,4,1,"0.736833152378035,0.10522806046048205,0.9418796835016118"
+    3,4,2,"0.5749339182767451,0.20181320245665535,0.490938012147181"
+    1,4,2,"0.7697294432580938,0.49397782380750765,0.10864079337442234"
+    0,4,0,"0.1364240150959487,0.1393107840629273,0.7901988878812207"
+    2,3,1,"0.42988138237505735,0.18389137408509248,0.18431292077750894"
+    0,4,2,"0.8613368738351794,0.67985810014162,0.6580438064356824"
+    2,4,1,"0.6594951663841697,0.26499036865016423,0.7891429392727503"
+    4,1,0,"0.36649684241348557,0.9511783938523962,0.8494919263589972"
+    1,1,2,"0.698592283371875,0.038622249776255946,0.5563827995742111"
+    0,4,1,"0.5227112950269823,0.3148264185956532,0.47562693094002173"
+``nodes_0.csv`` and ``nodes_1.csv``:
+.. code::
+    node_id,label,feat
+    0,2,"0.5400687466285844,0.7588441197954202,0.4268254673041745"
+    1,1,"0.08680051341900807,0.11446843700743892,0.7196969604886617"
+    2,2,"0.8964389655603473,0.23368113896545695,0.8813472954005022"
+    3,1,"0.5454703921677284,0.7819383771535038,0.3027939452162367"
+    4,1,"0.5365210052235699,0.8975240205792763,0.7613943085507672"
+After loaded, the dataset has one heterograph with features and labels:
+.. code:: python
+    >>> import dgl
+    >>> dataset = dgl.data.CSVDataset('./mini_hetero_dataset')
+    >>> g = dataset[0]  # only one graph
+    >>> print(g)
+    Graph(num_nodes={'item': 5, 'user': 5},
+          num_edges={('user', 'follow', 'user'): 10, ('user', 'like', 'item'): 10},
+          metagraph=[('user', 'user', 'follow'), ('user', 'item', 'like')])
+    >>> g.nodes['user'].data
+    {'label': tensor([2, 1, 2, 1, 1]), 'feat': tensor([[0.5401, 0.7588, 0.4268],
+            [0.0868, 0.1145, 0.7197],
+            [0.8964, 0.2337, 0.8813],
+            [0.5455, 0.7819, 0.3028],
+            [0.5365, 0.8975, 0.7614]], dtype=torch.float64)}
+    >>> g.edges['like'].data
+    {'label': tensor([1, 2, 2, 0, 1, 2, 1, 0, 2, 1]), 'feat': tensor([[0.7368, 0.1052, 0.9419],
+            [0.5749, 0.2018, 0.4909],
+            [0.7697, 0.4940, 0.1086],
+            [0.1364, 0.1393, 0.7902],
+            [0.4299, 0.1839, 0.1843],
+            [0.8613, 0.6799, 0.6580],
+            [0.6595, 0.2650, 0.7891],
+            [0.3665, 0.9512, 0.8495],
+            [0.6986, 0.0386, 0.5564],
+            [0.5227, 0.3148, 0.4756]], dtype=torch.float64)}
+Dataset of multiple graphs
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+When there are multiple graphs, one can include an additional CSV file for storing graph-level features.
+Here is an example:
+.. code::
+    ./mini_multi_dataset/
+    |-- meta.yaml
+    |-- nodes.csv
+    |-- edges.csv
+    |-- graphs.csv
+Accordingly, the ``meta.yaml`` should include an extra ``graph_data`` key to tell which CSV file to
+load graph-level features from.
+.. code:: yaml
+    dataset_name: mini_multi_dataset
+    edge_data:
+    - file_name: edges.csv
+    node_data:
+    - file_name: nodes.csv
+    graph_data:
+    - file_name: graphs.csv
+To distinguish nodes and edges of different graphs, the ``node.csv`` and ``edge.csv`` must contain
+an extra column ``graph_id``:
+``edges.csv``:
+.. code::
+    graph_id,src_id,dst_id,feat
+    0,0,4,"0.39534097273254654,0.9422093637539785,0.634899790318452"
+    0,3,0,"0.04486384200747007,0.6453746567017163,0.8757520744192612"
+    0,3,2,"0.9397636966928355,0.6526403892728874,0.8643238446466464"
+    0,1,1,"0.40559906615287566,0.9848072295736628,0.493888090726854"
+    0,4,1,"0.253458867276219,0.9168191778828504,0.47224962583565544"
+    0,0,1,"0.3219496197945605,0.3439899477636117,0.7051530741717352"
+    0,2,1,"0.692873149428549,0.4770019763881086,0.21937428942781778"
+    0,4,0,"0.620118223673067,0.08691420300562658,0.86573472329756"
+    0,2,1,"0.00743445923710373,0.5251800239734318,0.054016385555202384"
+    0,4,1,"0.6776417760682221,0.7291568018841328,0.4523600060547709"
+    1,1,3,"0.6375445528248924,0.04878384701995819,0.4081642382536248"
+    1,0,4,"0.776002616178397,0.8851294998284638,0.7321742043493028"
+    1,1,0,"0.0928555079874982,0.6156748364694707,0.6985674921582508"
+    1,0,2,"0.31328748118329997,0.8326121496142408,0.04133991340612775"
+    1,1,0,"0.36786902637778773,0.39161865931662243,0.9971749359397111"
+    1,1,1,"0.4647410679872376,0.8478810655406659,0.6746269314422184"
+    1,0,2,"0.8117650553546695,0.7893727601272978,0.41527155506593394"
+    1,1,3,"0.40707309111756307,0.2796588354307046,0.34846782265758314"
+    1,1,0,"0.18626464175355095,0.3523777809254057,0.7863421810531344"
+    1,3,0,"0.28357022069634585,0.13774964202156292,0.5913335505943637"
+``nodes.csv``:
+.. code::
+    graph_id,node_id,feat
+    0,0,"0.5725330322207948,0.8451870383322376,0.44412796119211184"
+    0,1,"0.6624186423087752,0.6118386331195641,0.7352138669985214"
+    0,2,"0.7583372765843964,0.15218126307872892,0.6810484348765842"
+    0,3,"0.14627522432017592,0.7457985352827006,0.1037097085190507"
+    0,4,"0.49037522512771525,0.8778998699783784,0.0911194482288028"
+    1,0,"0.11158102039672668,0.08543289788089736,0.6901745368284345"
+    1,1,"0.28367647637469273,0.07502571020414439,0.01217200152200748"
+    1,2,"0.2472495901894738,0.24285506608575758,0.6494437360242048"
+    1,3,"0.5614197853127827,0.059172654879085296,0.4692371689047904"
+    1,4,"0.17583413999295983,0.5191278830882644,0.8453123358491914"
+The ``graphs.csv`` contains a ``graph_id`` column and arbitrary number of feature columns.
+The example dataset here has two graphs, each with a ``feat`` and a ``label`` graph-level
+data.
+.. code::
+    graph_id,feat,label
+    0,"0.7426272601929126,0.5197462471155317,0.8149104951283953",0
+    1,"0.534822233529295,0.2863627767733977,0.1154897249106891",0
+After loaded, the dataset has multiple homographs with features and labels:
+.. code:: python
+    >>> import dgl
+    >>> dataset = dgl.data.CSVDataset('./mini_multi_dataset')
+    >>> print(len(dataset))
+    2
+    >>> graph0, data0 = dataset[0]
+    >>> print(graph0)
+    Graph(num_nodes=5, num_edges=10,
+          ndata_schemes={'feat': Scheme(shape=(3,), dtype=torch.float64)}
+          edata_schemes={'feat': Scheme(shape=(3,), dtype=torch.float64)})
+    >>> print(data0)
+    {'feat': tensor([0.7426, 0.5197, 0.8149], dtype=torch.float64), 'label': tensor(0)}
+    >>> graph1, data1 = dataset[1]
+    >>> print(graph1)
+    Graph(num_nodes=5, num_edges=10,
+          ndata_schemes={'feat': Scheme(shape=(3,), dtype=torch.float64)}
+          edata_schemes={'feat': Scheme(shape=(3,), dtype=torch.float64)})
+    >>> print(data1)
+    {'feat': tensor([0.5348, 0.2864, 0.1155], dtype=torch.float64), 'label': tensor(0)}
+Custom Data Parser
+~~~~~~~~~~~~~~~~~~
+By default, ``CSVDataset`` assumes that all the stored node-/edge-/graph- level data are numerical
+values. Users can provide custom ``DataParser`` to ``CSVDataset`` to handle more complex
+data type. A ``DataParser`` needs to implement the ``__call__`` method which takes in the
+:class:`pandas.DataFrame` object created from CSV file and should return a dictionary of
+parsed feature data. The parsed feature data will be saved to the ``ndata`` and ``edata`` of
+the corresponding ``DGLGraph`` object, and thus must be tensors or numpy arrays. Below shows an example
+``DataParser`` which converts string type labels to integers:
+Given a dataset as follows,
+.. code::
+    ./customized_parser_dataset/
+    |-- meta.yaml
+    |-- nodes.csv
+    |-- edges.csv
+``meta.yaml``:
+.. code:: yaml
+    dataset_name: customized_parser_dataset
+    edge_data:
+    - file_name: edges.csv
+    node_data:
+    - file_name: nodes.csv
+``edges.csv``:
+.. code::
+    src_id,dst_id,label
+    4,0,positive
+    4,0,negative
+    0,3,positive
+    0,1,positive
+    0,2,negative
+    0,0,positive
+    2,2,negative
+    1,0,positive
+    3,0,negative
+    4,0,positive
+``nodes.csv``:
+.. code::
+    node_id,label
+    0,positive
+    1,negative
+    2,positive
+    3,negative
+    4,positive
+To parse the string type labels, one can define a ``DataParser`` class as follows:
+.. code:: python
+    import numpy as np
+    import pandas as pd
+    class MyDataParser:
+        def __call__(self, df: pd.DataFrame):
+            parsed = {}
+            for header in df:
+                if 'Unnamed' in header:  # Handle Unnamed column
+                    print("Unamed column is found. Ignored...")
+                    continue
+                dt = df[header].to_numpy().squeeze()
+                if header == 'label':
+                    dt = np.array([1 if e == 'positive' else 0 for e in dt])
+                parsed[header] = dt
+            return parsed
+Create a ``CSVDataset`` using the defined ``DataParser``:
+.. code:: python
+    >>> import dgl
+    >>> dataset = dgl.data.CSVDataset('./customized_parser_dataset',
+    ...                               ndata_parser=MyDataParser(),
+    ...                               edata_parser=MyDataParser())
+    >>> print(dataset[0].ndata['label'])
+    tensor([1, 0, 1, 0, 1])
+    >>> print(dataset[0].edata['label'])
+    tensor([1, 0, 1, 1, 0, 1, 0, 1, 0, 1])
+.. note::
+    To specify different ``DataParser``\s for different node/edge types, pass a dictionary to
+    ``ndata_parser`` and ``edata_parser``, where the key is type name (a single string for
+    node type; a string triplet for edge type) and the value is the ``DataParser`` to use.
+Full YAML Specification
+~~~~~~~~~~~~~~~~~~~~~~~
+``CSVDataset`` allows more flexible control over the loading and parsing process. For example, one
+can change the ID column names via ``meta.yaml``. The example below lists all the supported keys.
+.. code:: yaml
+    version: 1.0.0
+    dataset_name: some_complex_data
+    separator: ','                   # CSV separator symbol. Default: ','
+    edge_data:
+    - file_name: edges_0.csv
+      etype: [user, follow, user]
+      src_id_field: src_id           # Column name for source node IDs. Default: src_id
+      dst_id_field: dst_id           # Column name for destination node IDs. Default: dst_id
+    - file_name: edges_1.csv
+      etype: [user, like, item]
+      src_id_field: src_id
+      dst_id_field: dst_id
+    node_data:
+    - file_name: nodes_0.csv
+      ntype: user
+      node_id_field: node_id         # Column name for node IDs. Default: node_id
+    - file_name: nodes_1.csv
+      ntype: item
+      node_id_field: node_id         # Column name for node IDs. Default: node_id
+    graph_data:
+      file_name: graphs.csv
+      graph_id_field: graph_id       # Column name for graph IDs. Default: graph_id
+Top-level
+^^^^^^^^^^^^^^
+At the top level, only 6 keys are available:
+  - ``version``: Optional. String.
+    It specifies which version of ``meta.yaml`` is used. More feature may be added in the future.
+  - ``dataset_name``: Required. String.
+    It specifies the dataset name.
+  - ``separator``: Optional. String.
+    It specifies how to parse data in CSV files. Default: ``','``.
+  - ``edge_data``: Required. List of ``EdgeData``.
+    Meta data for parsing edge CSV files.
+  - ``node_data``: Required. List of ``NodeData``.
+    Meta data for parsing node CSV files.
+  - ``graph_data``: Optional. ``GraphData``.
+    Meta data for parsing the graph CSV file.
+``EdgeData``
+^^^^^^^^^^^^^^^^^^^^^^
+There are 4 keys:
+  - ``file_name``: Required. String.
+    The CSV file to load data from.
+  - ``etype``: Optional. List of string.
+    Edge type name in string triplet: [source node type, relation type, destination node type].
+  - ``src_id_field``: Optional. String.
+    Which column to read for source node IDs. Default: ``src_id``.
+  - ``dst_id_field``: Optional. String.
+    Which column to read for destination node IDs. Default: ``dst_id``.
+``NodeData``
+^^^^^^^^^^^^^^^^^^^^^^
+There are 3 keys:
+  - ``file_name``: Required. String.
+    The CSV file to load data from.
+  - ``ntype``: Optional. String.
+    Node type name.
+  - ``node_id_field``: Optional. String.
+    Which column to read for node IDs. Default: ``node_id``.
+``GraphData``
+^^^^^^^^^^^^^^^^^^^^^^
+There are 2 keys:
+  - ``file_name``: Required. String.
+    The CSV file to load data from.
+  - ``graph_id_field``: Optional. String.
+    Which column to read for graph IDs. Default: ``graph_id``.
\ No newline at end of file
--- a/docs/source/guide/data.rst
+++ b/docs/source/guide/data.rst
@@ -23,6 +23,7 @@ shows how to implement each component of it.
 * :ref:`guide-data-pipeline-process`
 * :ref:`guide-data-pipeline-savenload`
 * :ref:`guide-data-pipeline-loadogb`
+* :ref:`guide-data-pipeline-loadcsv`
 .. toctree::
    :maxdepth: 1
@@ -34,3 +35,4 @@ shows how to implement each component of it.
    data-process
    data-savenload
    data-loadogb
+    data-loadcsv
\ No newline at end of file
--- a/python/dgl/data/csv_dataset.py
+++ b/python/dgl/data/csv_dataset.py
@@ -6,8 +6,7 @@ from ..base import DGLError
 class CSVDataset(DGLDataset):
-    """ This class aims to parse data from CSV files, construct DGLGraph
+    """Dataset class that loads and parses graph data from CSV files.
-        and behaves as a DGLDataset.
    Parameters
    ----------
@@ -51,7 +50,9 @@ class CSVDataset(DGLDataset):
        any available graph-level data such as graph-level feature, labels.
    Examples
-    [TODO]: link to a detailed web page.
+    --------
+    Please refer to :ref:`guide-data-pipeline-loadcsv`.
    """
    META_YAML_NAME = 'meta.yaml'

--- a/tutorials/blitz/6_load_data.py
+++ b/tutorials/blitz/6_load_data.py
@@ -222,6 +222,16 @@ dataset = SyntheticDataset()
 graph, label = dataset[0]
 print(graph, label)
+######################################################################
+# Creating Dataset from CSV via :class:`~dgl.data.CSVDataset`
+# ------------------------------------------------------------
+#
+# The previous examples describe how to create a dataset from CSV files
+# step-by-step. DGL also provides a utility class :class:`~dgl.data.CSVDataset`
+# for reading and parsing data from CSV files. See :ref:`guide-data-pipeline-loadcsv`
+# for more details.
+#
 # Thumbnail credits: (Un)common Use Cases for Graph Databases, Michal Bachman
 # sphinx_gallery_thumbnail_path = '_static/blitz_6_load_data.png'