refine doc

3b49370d · RhettYing · 17229352 · 3b49370d
Commit 3b49370d authored Feb 17, 2022 by RhettYing
Show whitespace changes
Inline Side-by-side

Showing with 262 additions and 300 deletions

docs/source/guide/data-loadcsv.rst docs/source/guide/data-loadcsv.rst +262 -300

No files found.
--- a/docs/source/guide/data-loadcsv.rst
+++ b/docs/source/guide/data-loadcsv.rst
 .. _guide-data-pipeline-loadcsv:
-4.6 Loading datasets from CSV files
+4.6 Loading data from CSV files
 ----------------------------------------------
-Problem & Motivation
+Comma Separated Value (CSV) is a widely used data storage format. DGL provides
-~~~~~~~~~~~~~~~~~~~~
+:class:`~dgl.data.CSVDataset` for loading and parsing graph data stored in
+CSV format.
-With the growing interests in graph deep learning, many ML researchers or data scientists
+To create a ``CSVDataset`` object:
-wish to try GNN models on custom datasets. Although DGL has a recommended practice on how
-a dataset object should behave (see :ref:`guide-data-pipeline-dataset`) once loaded into
-RAM, the on-disk storage format is still largely arbitrary. This proposal is to define an
-on-disk graph storage format based on Comma Separated Value (CSV) as well as to add a new
-dataset class called :class:`~dgl.data.DGLCSVDataset` for loading and processing it to
-accord with the current data pipeline practice. We choose CSV format due to its wide
-acceptance, good readability and rich set of toolkits for loading, creating and manipulating
-it (e.g., ``pandas``).
-Use :class:`~dgl.data.DGLCSVDataset` in DGL
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-To create a DGLCSVDataset object:
 .. code:: python
    import dgl
-    ds = dgl.data.DGLCSVDataset('/path/to/dataset')
+    ds = dgl.data.CSVDataset('/path/to/dataset')
-The returned ``ds`` object is as standard :class:`~dgl.data.DGLDataset`. For example, if the
+The returned ``ds`` object is a standard :class:`~dgl.data.DGLDataset`. For
-dataset is for single-graph node classification, you can use it as:
+example, one can get graph samples using ``__getitem__`` as well as node/edge
+features using ``ndata``/``edata``.
 .. code:: python
+    # A demonstration of how to use the loaded dataset. The feature names
+    # may vary depending on the CSV contents.
    g = ds[0] # get the graph
    label = g.ndata['label']
    feat = g.ndata['feat']
@@ -49,15 +40,12 @@ Data folder structure
    |-- graphs.csv    # graph-level features
 Node/edge/graph-level data are stored in CSV files. ``meta.yaml`` is a metadata file specifying
-where to read nodes/edges/graphs data and how to parse them in order to construct the dataset
+where to read nodes/edges/graphs data and how to parse them to construct the dataset
 object. A minimal data folder contains one ``meta.yaml`` and two CSVs, one for node data and one
-for edge data, in which case the dataset only contains a single graph with no graph-level data.
+for edge data, in which case the dataset contains only a single graph with no graph-level data.
-Examples
-~~~~~~~~
 Dataset of a single feature-less graph
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 When the dataset contains only one graph with no node or edge features, there need only three
 files in the data folder: ``meta.yaml``, one CSV for node IDs and one CSV for edges:
@@ -111,25 +99,30 @@ After loaded, the dataset has one graph without any features:
 .. code:: python
-    import dgl
+    >>> import dgl
-    dataset = dgl.data.DGLCSVDataset('./mini_featureless_dataset')
+    >>> dataset = dgl.data.CSVDataset('./mini_featureless_dataset')
-    g = dataset[0]  # only one graph
+    >>> g = dataset[0]  # only one graph
-    print(g)
+    >>> print(g)
-    #Graph(num_nodes=5, num_edges=10,
+    Graph(num_nodes=5, num_edges=10,
-    #    ndata_schemes={}
+          ndata_schemes={}
-    #    edata_schemes={})
+          edata_schemes={})
-A graph without any feature is often of less interest. In the next example, we will show
+.. note::
-how node or edge features are stored.
+    Non-integer node IDs are allowed. When constructing the graph, ``CSVDataset`` will
+    map each raw ID to an integer ID starting from zero.
 .. note::
-    Graph generated here is always directed. If you need reverse edges, please specify manually.
+    Edges are always directed. To have both directions, add reversed edges in the edge
+    CSV file or use :class:`~dgl.transform.AddReverse` to transform the loaded graph.
+A graph without any feature is often of less interest. In the next example, we will show
+how to load and parse node or edge features.
 Dataset of a single graph with features and labels
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-When the dataset contains only one graph with node or edge features and labels, there still
+When the dataset contains a single graph with node or edge features and labels, there still
 need only three files in the data folder: ``meta.yaml``, one CSV for node IDs and one CSV
 for edges:
@@ -150,57 +143,58 @@ for edges:
    node_data:
    - file_name: nodes.csv
-``edges.csv``:
+``edges.csv`` with five synthetic edge data (``label``, ``train_mask``, ``val_mask``, ``test_mask``, ``feat``):
 .. code::
    src_id,dst_id,label,train_mask,val_mask,test_mask,feat
-    4,0,2,False,True,True,"[0.5477868606453535, 0.4470617033458436, 0.936706701616337]"
+    4,0,2,False,True,True,"0.5477868606453535, 0.4470617033458436, 0.936706701616337"
-    4,0,0,False,False,True,"[0.9794634290792008, 0.23682038840665198, 0.049629338970987646]"
+    4,0,0,False,False,True,"0.9794634290792008, 0.23682038840665198, 0.049629338970987646"
-    0,3,1,True,True,True,"[0.8586722047523594, 0.5746912787380253, 0.6462162561249654]"
+    0,3,1,True,True,True,"0.8586722047523594, 0.5746912787380253, 0.6462162561249654"
-    0,1,2,True,False,False,"[0.2730008213674695, 0.5937484188166621, 0.765544096939567]"
+    0,1,2,True,False,False,"0.2730008213674695, 0.5937484188166621, 0.765544096939567"
-    0,2,1,True,True,True,"[0.45441619816038514, 0.1681403185591509, 0.9952376085297715]"
+    0,2,1,True,True,True,"0.45441619816038514, 0.1681403185591509, 0.9952376085297715"
-    0,0,0,False,False,False,"[0.4197669213305396, 0.849983324532477, 0.16974127573016262]"
+    0,0,0,False,False,False,"0.4197669213305396, 0.849983324532477, 0.16974127573016262"
-    2,2,1,False,True,True,"[0.5495035052928215, 0.21394654203489705, 0.7174910641836348]"
+    2,2,1,False,True,True,"0.5495035052928215, 0.21394654203489705, 0.7174910641836348"
-    1,0,2,False,True,False,"[0.008790817766266334, 0.4216530595907526, 0.529195480661293]"
+    1,0,2,False,True,False,"0.008790817766266334, 0.4216530595907526, 0.529195480661293"
-    3,0,0,True,True,True,"[0.6598715708878852, 0.1932390907048961, 0.9774471538377553]"
+    3,0,0,True,True,True,"0.6598715708878852, 0.1932390907048961, 0.9774471538377553"
-    4,0,1,False,False,False,"[0.16846068931179736, 0.41516080644186737, 0.002158116134429955]"
+    4,0,1,False,False,False,"0.16846068931179736, 0.41516080644186737, 0.002158116134429955"
-``nodes.csv``:
+``nodes.csv`` with five synthetic node data (``label``, ``train_mask``, ``val_mask``, ``test_mask``, ``feat``):
 .. code::
    node_id,label,train_mask,val_mask,test_mask,feat
-    0,1,False,True,True,"[0.07816474278491703, 0.9137336384979067, 0.4654086994009452]"
+    0,1,False,True,True,"0.07816474278491703, 0.9137336384979067, 0.4654086994009452"
-    1,1,True,True,True,"[0.05354099924658973, 0.8753101998792645, 0.33929432608774135]"
+    1,1,True,True,True,"0.05354099924658973, 0.8753101998792645, 0.33929432608774135"
-    2,1,True,False,True,"[0.33234211884156384, 0.9370522452510665, 0.6694943496824788]"
+    2,1,True,False,True,"0.33234211884156384, 0.9370522452510665, 0.6694943496824788"
-    3,0,False,True,False,"[0.9784264442230887, 0.22131880861864428, 0.3161154827254189]"
+    3,0,False,True,False,"0.9784264442230887, 0.22131880861864428, 0.3161154827254189"
-    4,1,True,True,False,"[0.23142237259162102, 0.8715767748481147, 0.19117861103555467]"
+    4,1,True,True,False,"0.23142237259162102, 0.8715767748481147, 0.19117861103555467"
-After loaded, the dataset has one graph with features and labels:
+After loaded, the dataset has one graph. Node/edge features are stored in ```ndata`` and ``edata``
+with the same column names. The example demonstrates how to specify a vector-shaped feature --
+using comma-separated list enclosed by double quotes ``"..."``.
 .. code:: python
-    import dgl
+    >>> import dgl
-    dataset = dgl.data.DGLCSVDataset('./mini_feature_dataset')
+    >>> dataset = dgl.data.CSVDataset('./mini_feature_dataset')
-    g = dataset[0]  # only one graph
+    >>> g = dataset[0]  # only one graph
-    print(g)
+    >>> print(g)
-    #Graph(num_nodes=5, num_edges=10,
+    Graph(num_nodes=5, num_edges=10,
-    #    ndata_schemes={'label': Scheme(shape=(), dtype=torch.int64), 'train_mask': Scheme(shape=(), dtype=torch.bool), 'val_mask': Scheme(shape=(), dtype=torch.bool), 'test_mask': Scheme(shape=(), dtype=torch.bool), 'feat': Scheme(shape=(3,), dtype=torch.float64)}
+          ndata_schemes={'label': Scheme(shape=(), dtype=torch.int64), 'train_mask': Scheme(shape=(), dtype=torch.bool), 'val_mask': Scheme(shape=(), dtype=torch.bool), 'test_mask': Scheme(shape=(), dtype=torch.bool), 'feat': Scheme(shape=(3,), dtype=torch.float64)}
-    #    edata_schemes={'label': Scheme(shape=(), dtype=torch.int64), 'train_mask': Scheme(shape=(), dtype=torch.bool), 'val_mask': Scheme(shape=(), dtype=torch.bool), 'test_mask': Scheme(shape=(), dtype=torch.bool), 'feat': Scheme(shape=(3,), dtype=torch.float64)})
+          edata_schemes={'label': Scheme(shape=(), dtype=torch.int64), 'train_mask': Scheme(shape=(), dtype=torch.bool), 'val_mask': Scheme(shape=(), dtype=torch.bool), 'test_mask': Scheme(shape=(), dtype=torch.bool), 'feat': Scheme(shape=(3,), dtype=torch.float64)})
 .. note::
-    All columns will be read, parsed and set as edge/node attributes except ``node_id`` in ``nodes.csv``,
+    By default, ``CSVDatatset`` assumes all feature data to be numerical values (e.g., int, float, bool or
-    ``src_id`` and ``dst_id`` in ``edges.csv``. User is able to access directly like: ``g.ndata[‘label’]``.
+    list) and missing values are not allowed. Users could provide custom data parser for these cases.
-    The keys in ``g.ndata`` and ``g.edata`` are the same as original column names. Data format is
+    See `Custom Data Parser`_ for more details.
-    infered automatically during parse.
 Dataset of a single heterogeneous graph
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-When the dataset contains only one heterograph with 2 node/edge types respectively, there need
+One can specify multiple node and edge CSV files (each for one type) to represent a heterogeneous graph.
-only 5 files in the data folder: ``meta.yaml``, 2 CSV for nodes and 2 CSV for edges:
+Here is an example data with two node types and two edge types:
 .. code::
@@ -211,34 +205,28 @@ only 5 files in the data folder: ``meta.yaml``, 2 CSV for nodes and 2 CSV for ed
    |-- edges_0.csv
    |-- edges_1.csv
-``meta.yaml``
+The ``meta.yaml`` specifies the node type name (using ``ntype``) and edge type name (using ``etype``)
-For heterogeneous graph, ``etype`` and ``ntype`` are MUST HAVE and UNIQUE in ``edge_data`` and
+of each CSV file. The edge type name is a string triplet containing the source node type name, relation
-``node_data`` respectively, or only the last etype/ntype is kept when generating graph as all
+name and the destination node type name.
-of them use the same default etype/ntype name. What's more, each node/edge csv file should
-contains single and unique ntype/etype. If there exist several ntype/etypes, multiple node/edge
-csv files are required.
 .. code:: yaml
    dataset_name: mini_hetero_dataset
    edge_data:
    - file_name: edges_0.csv
-      etype:
+      etype: [user, follow, user]
-      - user
-      - follow
-      - user
    - file_name: edges_1.csv
-      etype:
+      etype: [user, like, item]
-      - user
-      - like
-      - item
    node_data:
    - file_name: nodes_0.csv
      ntype: user
    - file_name: nodes_1.csv
      ntype: item
-``edges_0.csv``, ``edges_1.csv`` (Both are the same, just for example only.)
+The node and edge CSV files follow the same format as in homogeneous graphs. Here are some synthetic
+data for demonstration purposes:
+``edges_0.csv`` and ``edges_1.csv``:
 .. code::
@@ -254,7 +242,7 @@ csv files are required.
    1,1,2,"0.698592283371875,0.038622249776255946,0.5563827995742111"
    0,4,1,"0.5227112950269823,0.3148264185956532,0.47562693094002173"
-``nodes_0.csv``, ``nodes_1.csv`` (Both are the same, just for example only.)
+``nodes_0.csv`` and ``nodes_1.csv``:
 .. code::
@@ -269,37 +257,36 @@ After loaded, the dataset has one heterograph with features and labels:
 .. code:: python
-    import dgl
+    >>> import dgl
-    dataset = dgl.data.DGLCSVDataset('./mini_hetero_dataset')
+    >>> dataset = dgl.data.CSVDataset('./mini_hetero_dataset')
-    g = dataset[0]  # only one graph
+    >>> g = dataset[0]  # only one graph
-    print(g)
+    >>> print(g)
-    #Graph(num_nodes={'item': 5, 'user': 5},
+    Graph(num_nodes={'item': 5, 'user': 5},
-    #    num_edges={('user', 'follow', 'user'): 10, ('user', 'like', 'item'): 10},
+          num_edges={('user', 'follow', 'user'): 10, ('user', 'like', 'item'): 10},
-    #    metagraph=[('user', 'user', 'follow'), ('user', 'item', 'like')])
+          metagraph=[('user', 'user', 'follow'), ('user', 'item', 'like')])
-    g.nodes['user'].data
+    >>> g.nodes['user'].data
-    #{'label': tensor([2, 1, 2, 1, 1]), 'feat': tensor([[0.5401, 0.7588, 0.4268],
+    {'label': tensor([2, 1, 2, 1, 1]), 'feat': tensor([[0.5401, 0.7588, 0.4268],
-    #        [0.0868, 0.1145, 0.7197],
+            [0.0868, 0.1145, 0.7197],
-    #        [0.8964, 0.2337, 0.8813],
+            [0.8964, 0.2337, 0.8813],
-    #        [0.5455, 0.7819, 0.3028],
+            [0.5455, 0.7819, 0.3028],
-    #        [0.5365, 0.8975, 0.7614]], dtype=torch.float64)}
+            [0.5365, 0.8975, 0.7614]], dtype=torch.float64)}
-    g.edges['like'].data
+    >>> g.edges['like'].data
-    #{'label': tensor([1, 2, 2, 0, 1, 2, 1, 0, 2, 1]), 'feat': tensor([[0.7368, 0.1052, 0.9419],
+    {'label': tensor([1, 2, 2, 0, 1, 2, 1, 0, 2, 1]), 'feat': tensor([[0.7368, 0.1052, 0.9419],
-    #        [0.5749, 0.2018, 0.4909],
+            [0.5749, 0.2018, 0.4909],
-    #        [0.7697, 0.4940, 0.1086],
+            [0.7697, 0.4940, 0.1086],
-    #        [0.1364, 0.1393, 0.7902],
+            [0.1364, 0.1393, 0.7902],
-    #        [0.4299, 0.1839, 0.1843],
+            [0.4299, 0.1839, 0.1843],
-    #        [0.8613, 0.6799, 0.6580],
+            [0.8613, 0.6799, 0.6580],
-    #        [0.6595, 0.2650, 0.7891],
+            [0.6595, 0.2650, 0.7891],
-    #        [0.3665, 0.9512, 0.8495],
+            [0.3665, 0.9512, 0.8495],
-    #        [0.6986, 0.0386, 0.5564],
+            [0.6986, 0.0386, 0.5564],
-    #        [0.5227, 0.3148, 0.4756]], dtype=torch.float64)}
+            [0.5227, 0.3148, 0.4756]], dtype=torch.float64)}
 Dataset of multiple graphs
-^^^^^^^^^^^^^^^^^^^^^^^^^^
+~~~~~~~~~~~~~~~~~~~~~~~~~~
-When the dataset contains multiple graphs(for now, only homograph is supported) with node/edge/graph
+When there are multiple graphs, one can include an additional CSV file for storing graph-level features.
-level features, there need only 4 files in the data folder: ``meta.yaml``, one CSV file for
+Here is an example:
-nodes/edge/graphs respectively:
 .. code::
@@ -309,7 +296,8 @@ nodes/edge/graphs respectively:
    |-- edges.csv
    |-- graphs.csv
-``meta.yaml``:
+Accordingly, the ``meta.yaml`` should include an extra ``graph_data`` key to tell which CSV file to
+load graph-level features from.
 .. code:: yaml
@@ -319,11 +307,10 @@ nodes/edge/graphs respectively:
    node_data:
    - file_name: nodes.csv
    graph_data:
-    file_name: graphs.csv
+    - file_name: graphs.csv
-.. note::
+To distinguish nodes and edges of different graphs, the ``node.csv`` and ``edge.csv`` must contain
-    ``graph_id`` should be specified in nodes/edges/graphs CSV files or default value ``0`` is
+an extra column ``graph_id``:
-    used instead which probably caused unexpected/undefined behavior.
 ``edges.csv``:
@@ -367,7 +354,9 @@ nodes/edge/graphs respectively:
    1,3,"0.5614197853127827,0.059172654879085296,0.4692371689047904"
    1,4,"0.17583413999295983,0.5191278830882644,0.8453123358491914"
-``graphs.csv``:
+The ``graphs.csv`` contains a ``graph_id`` column and arbitrary number of feature columns.
+The example dataset here has two graphs, each with a ``feat`` and a ``label`` graph-level
+data.
 .. code::
@@ -375,159 +364,46 @@ nodes/edge/graphs respectively:
    0,"0.7426272601929126,0.5197462471155317,0.8149104951283953",0
    1,"0.534822233529295,0.2863627767733977,0.1154897249106891",0
 After loaded, the dataset has multiple homographs with features and labels:
 .. code:: python
-    import dgl
+    >>> import dgl
-    dataset = dgl.data.DGLCSVDataset('./mini_multi_dataset')
+    >>> dataset = dgl.data.CSVDataset('./mini_multi_dataset')
-    print(len(dataset))
+    >>> print(len(dataset))
-    #2
+    2
-    graph, label = dataset[0]
+    >>> graph0, data0 = dataset[0]
-    print(graph, label)
+    >>> print(graph0)
-    #Graph(num_nodes=5, num_edges=10,
+    Graph(num_nodes=5, num_edges=10,
-    #    ndata_schemes={'feat': Scheme(shape=(3,), dtype=torch.float64)}
+          ndata_schemes={'feat': Scheme(shape=(3,), dtype=torch.float64)}
-    #    edata_schemes={'feat': Scheme(shape=(3,), dtype=torch.float64)}) tensor(0)
+          edata_schemes={'feat': Scheme(shape=(3,), dtype=torch.float64)})
-    print(dataset.data)
+    >>> print(data0)
-    #{'feat': tensor([[0.7426, 0.5197, 0.8149],
+    {'feat': tensor([0.7426, 0.5197, 0.8149]), 'label': tensor([0])}
-    #        [0.5348, 0.2864, 0.1155]], dtype=torch.float64), 'label': tensor([0, 0])}
+    >>> graph1, data1 = dataset[1]
+    >>> print(graph1)
-YAML Specification
+    Graph(num_nodes=5, num_edges=10,
-~~~~~~~~~~~~~~~~~~
+          ndata_schemes={'feat': Scheme(shape=(3,), dtype=torch.float64)}
+          edata_schemes={'feat': Scheme(shape=(3,), dtype=torch.float64)})
-Example
+    >>> print(data1)
-^^^^^^^
+    {'feat': tensor([0.5348, 0.2864, 0.1155]), 'label': tensor([0])}
-In the YAML file below, all supported keys are listed together including those that have default
-values though not all the keys are required for a specific use.
-.. code:: yaml
-    version: 1.0.0
-    dataset_name: full_yaml
-    separator: ','
-    edge_data:
-    - file_name: edges_0.csv
-      etype:
-      - user
-      - follow
-      - user
-      src_id_field: src_id
-      dst_id_field: dst_id
-    - file_name: edges_1.csv
-      etype:
-      - user
-      - like
-      - item
-      src_id_field: src_id
-      dst_id_field: dst_id
-    node_data:
-    - file_name: nodes_0.csv
-      ntype: user
-      node_id_field: node_id
-    - file_name: nodes_1.csv
-      ntype: item
-      node_id_field: node_id
-    graph_data:
-      file_name: graphs.csv
-      graph_id_field: graph_id
-Top-level keys
-^^^^^^^^^^^^^^
-At the top level, only 6 keys are available.
-``version``
-Optional. String. It specifies which version of ``meta.yaml`` is used. more feature may be added and
-version is changed accordingly.
-``dataset_name``
-Required. String. It specifies the dataset name.
-``separator``
-Optional. String. It specifies how to parse data in CSV files. Default value: ``,``.
-``edge_data``
-Required. List of dict. It includes several sub-keys to help parse edges from CSV files.
-``node_data``
-Required. List of dict. It includes several sub-keys to help parse nodes from CSV files.
-``graph_data``
-Required. Dict. It includes several sub-keys to help parse graph-level information from CSV files.
-Keys for ``edge_data``
-^^^^^^^^^^^^^^^^^^^^^^
-``file_name``
-Required. String. It specifies the file name which stores edge data.
-``etype``
-Optional. List of string. It specifies the canonical edge type.
-``src_id_field``
-Optional. String. It specifies which column to be read for src ids. Default value: ``src_id``.
-``dst_id_field``
-Optional. String. It specifies which column to be read for dst ids. Default value: ``dst_id``.
-Keys for ``node_data``
-^^^^^^^^^^^^^^^^^^^^^^
-``file_name``
-Required. String. It specifies the file name which stores node data.
-``ntype``
-Optional. List of string. It specifies the canonical node type.
-``node_id_field``
-Optional. String. It specifies which column to be read for node ids. Default value: ``node_id``.
-Keys for ``graph_data``
-^^^^^^^^^^^^^^^^^^^^^^
-``file_name``
-Required. String. It specifies the file name which stores graph data.
-``graph_id_field``
-Optional. String. It specifies which column to be read for graph ids. Default value: ``graph_id``.
-Parse node/edge/grpah data on your own
-~~~~~~~~~~~~~~~~~~~~~~~~
-In default, all the data are attached to ``g.ndata`` with the same key as column name in ``nodes.csv``
+.. note::
-except ``node_id``. So does data in ``edges.csv``. Data is auto-formatted via ``pandas`` unless it's
-a string of float values(feature data is often of this format). For better experience, user is able
-to self-define node/edge/graph data parser which is callable and accept ``pandas.DataFrame`` as input
-data. Then pass such callable instance while instantiating ``DGLCSVDataset``. Below is an example.
-``SelfDefinedDataParser``:
+    When there are multiple graphs, ``CSVDataset`` currently requires them to be homogeneous.
-.. code:: python
-    import numpy as np
+Custom Data Parser
-    import ast
+~~~~~~~~~~~~~~~~~~
-    import pandas as pd
-    class SelfDefinedDataParser:
-        """Convert labels which are in string format into numeric values.
-        """
-        def __call__(self, df: pd.DataFrame):
-            data = {}
-            for header in df:
-                if 'Unnamed' in header:
-                    print("Unamed column is found. Ignored...")
-                    continue
-                dt = df[header].to_numpy().squeeze()
-                if header == 'label':
-                    dt = np.array([1 if e == 'positive' else 0 for e in dt])
-                data[header] = dt
-            return data
-Example:
+By default, ``CSVDataset`` assumes that all the stored node-/edge-/graph- level data are numerical
+values. Users can provide custom ``DataParser`` to ``CSVDataset`` to handle more complex
+data type. A ``DataParser`` needs to implement the ``__call__`` method which takes in the
+:class:`pandas.DataFrame` object created from CSV file and should return a dictionary of
+parsed feature data. The parsed feature data will be saved to the ``ndata`` and ``edata`` of
+the corresponding ``DGLGraph`` object, and thus must be tensors or numpy arrays. Below shows an example
+``DataParser`` which converts string type labels to integers:
-``customized_parser_dataset``:
+Given a dataset as follows,
 .. code::
@@ -573,40 +449,126 @@ Example:
    3,negative
    4,positive
-After loaded, the dataset has one graph with features and labels:
+To parse the string type labels, one can define a ``DataParser`` class as follows:
 .. code:: python
-    import dgl
+    import numpy as np
-    dataset = dgl.data.DGLCSVDataset('./customized_parser_dataset',
+    import pandas as pd
-                                     node_data_parser={'_V': SelfDefinedDataParser()},
-                                     edge_data_parser={('_V','_E','_V'): SelfDefinedDataParser()})
+    class MyDataParser:
-    print(dataset[0].ndata['label'])
+        def __call__(self, df: pd.DataFrame):
-    #tensor([1, 0, 1, 0, 1])
+            parsed = {}
-    print(dataset[0].edata['label'])
+            for header in df:
-    #tensor([1, 0, 1, 1, 0, 1, 0, 1, 0, 1])
+                if 'Unnamed' in header:  # Handle Unnamed column
+                    print("Unamed column is found. Ignored...")
-FAQs:
+                    continue
-~~~~~
+                dt = df[header].to_numpy().squeeze()
+                if header == 'label':
-What's the data type in CSV files?
+                    dt = np.array([1 if e == 'positive' else 0 for e in dt])
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+                parsed[header] = dt
+            return parsed
-A default data parser is used for parsing node/edge/graph csv files in default which infer data type
-automatically. ID related data such as ``node_id``, ``src_id``, ``dst_id``, ``graph_id`` are required
+ Create a ``CSVDataset`` using the defined ``DataParser``:
-to be ``numeric`` as these fields are used for constructing graph. Any other data will be attached to
-``g.ndata`` or ``g.edata`` directly, so it's user's responsibility to make sure the data type is expected
+.. code:: python
-when using within graph. In particular, ``string`` data which is composed of ``float`` values is splitted
-and cast into float value array by default data parser.
+    >>> dataset = dgl.data.CSVDataset('./customized_parser_dataset',
+    ...                               ndata_parser=MyDataParser(),
-What if some lines in CSV have missing values in several fields?
+    ...                               edata_parser=MyDataParser())
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+    >>> print(dataset[0].ndata['label'])
+    tensor([1, 0, 1, 0, 1])
-It’s undefined behavior. Please make sure the data is complete.
+    >>> print(dataset[0].edata['label'])
+    tensor([1, 0, 1, 1, 0, 1, 0, 1, 0, 1])
-What if ``graph_id`` is not specified in CSV?
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+.. note::
-For a single graph, such field in ``edge_data`` and ``node_data`` is not used at all. So it’s ok
+    To specify different ``DataParser`` s for different node/edge types, pass a dictionary to
-to ignore it. For multiple graphs, ``graph_id`` should be provided, or all edge/node data will be
+    ``ndata_parser`` and ``edata_parser``, where the key is type name (a single string for
-regarded as ``graph_id = 0``. This usually is not what you expect.
+    node type; a string triplet for edge type) and the value is the ``DataParser`` to use.
+Full YAML Specification
+~~~~~~~~~~~~~~~~~~~~~~~
+``CSVDataset`` allows more flexible control over the loading and parsing process. For example, one
+can change the ID column names via ``meta.yaml``. The example below lists all the supported keys.
+.. code:: yaml
+    version: 1.0.0
+    dataset_name: some_complex_data
+    separator: ','                   # CSV separator symbol. Default: ','
+    edge_data:
+    - file_name: edges_0.csv
+      etype: [user, follow, user]
+      src_id_field: src_id           # Column name for source node IDs. Default: src_id
+      dst_id_field: dst_id           # Column name for destination node IDs. Default: dst_id
+    - file_name: edges_1.csv
+      etype: [user, like, item]
+      src_id_field: src_id
+      dst_id_field: dst_id
+    node_data:
+    - file_name: nodes_0.csv
+      ntype: user
+      node_id_field: node_id         # Column name for node IDs. Default: node_id
+    - file_name: nodes_1.csv
+      ntype: item
+      node_id_field: node_id         # Column name for node IDs. Default: node_id
+    graph_data:
+      file_name: graphs.csv
+      graph_id_field: graph_id       # Column name for graph IDs. Default: graph_id
+Top-level
+^^^^^^^^^^^^^^
+At the top level, only 6 keys are available:
+  - ``version``: Optional. String.
+    It specifies which version of ``meta.yaml`` is used. More feature may be added in the future.
+  - ``dataset_name``: Required. String.
+    It specifies the dataset name.
+  - ``separator``: Optional. String.
+    It specifies how to parse data in CSV files. Default: ``','``.
+  - ``edge_data``: Required. List of ``EdgeData``.
+    Meta data for parsing edge CSV files.
+  - ``node_data``: Required. List of ``NodeData``.
+    Meta data for parsing node CSV files.
+  - ``graph_data``: Optional. ``GraphData``.
+    Meta data for parsing the graph CSV file.
+``EdgeData``
+^^^^^^^^^^^^^^^^^^^^^^
+There are 4 keys:
+  - ``file_name``: Required. String.
+    The CSV file to load data from.
+  - ``etype``: Optional. List of string.
+    Edge type name in string triplet: [source node type, relation type, destination node type].
+  - ``src_id_field``: Optional. String.
+    Which column to read for source node IDs. Default: ``src_id``.
+  - ``dst_id_field``: Optional. String.
+    Which column to read for destination node IDs. Default: ``dst_id``.
+``NodeData``
+^^^^^^^^^^^^^^^^^^^^^^
+There are 3 keys:
+  - ``file_name``: Required. String.
+    The CSV file to load data from.
+  - ``ntype``: Optional. String.
+    Node type name.
+  - ``node_id_field``: Optional. String.
+    Which column to read for node IDs. Default: ``node_id``.
+``GraphData``
+^^^^^^^^^^^^^^^^^^^^^^
+There are 2 keys:
+  - ``file_name``: Required. String.
+    The CSV file to load data from.
+  - ``graph_id_field``: Optional. String.
+    Which column to read for graph IDs. Default: ``graph_id``.
\ No newline at end of file