.. _guide-data-pipeline-loadcsv:

4.6 Loading datasets from CSV files
----------------------------------------------

Problem & Motivation
~~~~~~~~~~~~~~~~~~~~

With the growing interests in graph deep learning, many ML researchers or data scientists
wish to try GNN models on custom datasets. Although DGL has a recommended practice on how
a dataset object should behave (see :ref:`guide-data-pipeline-dataset`) once loaded into
RAM, the on-disk storage format is still largely arbitrary. This proposal is to define an
on-disk graph storage format based on Comma Separated Value (CSV) as well as to add a new
dataset class called :class:`~dgl.data.DGLCSVDataset` for loading and processing it to
accord with the current data pipeline practice. We choose CSV format due to its wide
acceptance, good readability and rich set of toolkits for loading, creating and manipulating
it (e.g., ``pandas``).

Use :class:`~dgl.data.DGLCSVDataset` in DGL
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

To create a DGLCSVDataset object:

.. code:: python

    import dgl
    ds = dgl.data.DGLCSVDataset('/path/to/dataset')

The returned ``ds`` object is as standard :class:`~dgl.data.DGLDataset`. For example, if the
dataset is for single-graph node classification, you can use it as:

.. code:: python

    g = ds[0] # get the graph
    label = g.ndata['label']
    feat = g.ndata['feat']

Data folder structure
~~~~~~~~~~~~~~~~~~~~~

.. code::

    /path/to/dataset/
    |-- meta.yaml     # metadata of the dataset
    |-- edges_0.csv   # edge data including src_id, dst_id, feature, label and so on
    |-- ...   # you can have as many CSVs for edge data as you want
    |-- nodes_0.csv   # node data including node_id, feature, label and so on
    |-- ...   # you can have as many CSVs for node data as you want
    |-- graphs.csv    # graph-level features

Node/edge/graph-level data are stored in CSV files. ``meta.yaml`` is a metadata file specifying
where to read nodes/edges/graphs data and how to parse them in order to construct the dataset
object. A minimal data folder contains one ``meta.yaml`` and two CSVs, one for node data and one
for edge data, in which case the dataset only contains a single graph with no graph-level data.

Examples
~~~~~~~~

Dataset of a single feature-less graph
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

When the dataset contains only one graph with no node or edge features, there need only three
files in the data folder: ``meta.yaml``, one CSV for node IDs and one CSV for edges:

.. code::

    ./mini_featureless_dataset/
    |-- meta.yaml
    |-- nodes.csv
    |-- edges.csv

``meta.yaml`` contains the following information:

.. code:: yaml

    dataset_name: mini_featureless_dataset
    edge_data:
    - file_name: edges.csv
    node_data:
    - file_name: nodes.csv

``nodes.csv`` lists the node IDs under the ``node_id`` field:

.. code::

    node_id
    0
    1
    2
    3
    4

``edges.csv`` lists all the edges in two columns (``src_id`` and ``dst_id``) specifying the
source and destination node ID of each edge:

.. code::

    src_id,dst_id
    4,4
    4,1
    3,0
    4,1
    4,0
    1,2
    1,3
    3,3
    1,1
    4,1

After loaded, the dataset has one graph without any features:

.. code:: python

    import dgl
    dataset = dgl.data.DGLCSVDataset('./mini_featureless_dataset')
    g = dataset[0]  # only one graph
    print(g)
    #Graph(num_nodes=5, num_edges=10,
    #    ndata_schemes={}
    #    edata_schemes={})


A graph without any feature is often of less interest. In the next example, we will show
how node or edge features are stored.

.. note::
    Graph generated here is always directed. If you need reverse edges, please specify manually.

Dataset of a single graph with features and labels
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

When the dataset contains only one graph with node or edge features and labels, there still
need only three files in the data folder: ``meta.yaml``, one CSV for node IDs and one CSV
for edges:

.. code::

    ./mini_feature_dataset/
    |-- meta.yaml
    |-- nodes.csv
    |-- edges.csv

``meta.yaml``:

.. code:: yaml

    dataset_name: mini_feature_dataset
    edge_data:
    - file_name: edges.csv
    node_data:
    - file_name: nodes.csv

``edges.csv``:

.. code::

    src_id,dst_id,label,train_mask,val_mask,test_mask,feat
    4,0,2,False,True,True,"[0.5477868606453535, 0.4470617033458436, 0.936706701616337]"
    4,0,0,False,False,True,"[0.9794634290792008, 0.23682038840665198, 0.049629338970987646]"
    0,3,1,True,True,True,"[0.8586722047523594, 0.5746912787380253, 0.6462162561249654]"
    0,1,2,True,False,False,"[0.2730008213674695, 0.5937484188166621, 0.765544096939567]"
    0,2,1,True,True,True,"[0.45441619816038514, 0.1681403185591509, 0.9952376085297715]"
    0,0,0,False,False,False,"[0.4197669213305396, 0.849983324532477, 0.16974127573016262]"
    2,2,1,False,True,True,"[0.5495035052928215, 0.21394654203489705, 0.7174910641836348]"
    1,0,2,False,True,False,"[0.008790817766266334, 0.4216530595907526, 0.529195480661293]"
    3,0,0,True,True,True,"[0.6598715708878852, 0.1932390907048961, 0.9774471538377553]"
    4,0,1,False,False,False,"[0.16846068931179736, 0.41516080644186737, 0.002158116134429955]"


``nodes.csv``:

.. code::

    node_id,label,train_mask,val_mask,test_mask,feat
    0,1,False,True,True,"[0.07816474278491703, 0.9137336384979067, 0.4654086994009452]"
    1,1,True,True,True,"[0.05354099924658973, 0.8753101998792645, 0.33929432608774135]"
    2,1,True,False,True,"[0.33234211884156384, 0.9370522452510665, 0.6694943496824788]"
    3,0,False,True,False,"[0.9784264442230887, 0.22131880861864428, 0.3161154827254189]"
    4,1,True,True,False,"[0.23142237259162102, 0.8715767748481147, 0.19117861103555467]"

After loaded, the dataset has one graph with features and labels:

.. code:: python

    import dgl
    dataset = dgl.data.DGLCSVDataset('./mini_feature_dataset')
    g = dataset[0]  # only one graph
    print(g)
    #Graph(num_nodes=5, num_edges=10,
    #    ndata_schemes={'label': Scheme(shape=(), dtype=torch.int64), 'train_mask': Scheme(shape=(), dtype=torch.bool), 'val_mask': Scheme(shape=(), dtype=torch.bool), 'test_mask': Scheme(shape=(), dtype=torch.bool), 'feat': Scheme(shape=(3,), dtype=torch.float64)}
    #    edata_schemes={'label': Scheme(shape=(), dtype=torch.int64), 'train_mask': Scheme(shape=(), dtype=torch.bool), 'val_mask': Scheme(shape=(), dtype=torch.bool), 'test_mask': Scheme(shape=(), dtype=torch.bool), 'feat': Scheme(shape=(3,), dtype=torch.float64)})

.. note::
    All columns will be read, parsed and set as edge/node attributes except ``node_id`` in ``nodes.csv``,
    ``src_id`` and ``dst_id`` in ``edges.csv``. User is able to access directly like: ``g.ndata[‘label’]``.
    The keys in ``g.ndata`` and ``g.edata`` are the same as original column names. Data format is
    infered automatically during parse.

Dataset of a single heterogeneous graph
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

When the dataset contains only one heterograph with 2 node/edge types respectively, there need
only 5 files in the data folder: ``meta.yaml``, 2 CSV for nodes and 2 CSV for edges:

.. code::

    ./mini_hetero_dataset/
    |-- meta.yaml
    |-- nodes_0.csv
    |-- nodes_1.csv
    |-- edges_0.csv
    |-- edges_1.csv

``meta.yaml``
For heterogeneous graph, ``etype`` and ``ntype`` are MUST HAVE and UNIQUE in ``edge_data`` and
``node_data`` respectively, or only the last etype/ntype is kept when generating graph as all
of them use the same default etype/ntype name. What's more, each node/edge csv file should
contains single and unique ntype/etype. If there exist several ntype/etypes, multiple node/edge
csv files are required.

.. code:: yaml

    dataset_name: mini_hetero_dataset
    edge_data:
    - file_name: edges_0.csv
      etype:
      - user
      - follow
      - user
    - file_name: edges_1.csv
      etype:
      - user
      - like
      - item
    node_data:
    - file_name: nodes_0.csv
      ntype: user
    - file_name: nodes_1.csv
      ntype: item

``edges_0.csv``, ``edges_1.csv`` (Both are the same, just for example only.)

.. code::

    src_id,dst_id,label,feat
    4,4,1,"0.736833152378035,0.10522806046048205,0.9418796835016118"
    3,4,2,"0.5749339182767451,0.20181320245665535,0.490938012147181"
    1,4,2,"0.7697294432580938,0.49397782380750765,0.10864079337442234"
    0,4,0,"0.1364240150959487,0.1393107840629273,0.7901988878812207"
    2,3,1,"0.42988138237505735,0.18389137408509248,0.18431292077750894"
    0,4,2,"0.8613368738351794,0.67985810014162,0.6580438064356824"
    2,4,1,"0.6594951663841697,0.26499036865016423,0.7891429392727503"
    4,1,0,"0.36649684241348557,0.9511783938523962,0.8494919263589972"
    1,1,2,"0.698592283371875,0.038622249776255946,0.5563827995742111"
    0,4,1,"0.5227112950269823,0.3148264185956532,0.47562693094002173"

``nodes_0.csv``, ``nodes_1.csv`` (Both are the same, just for example only.)

.. code::

    node_id,label,feat
    0,2,"0.5400687466285844,0.7588441197954202,0.4268254673041745"
    1,1,"0.08680051341900807,0.11446843700743892,0.7196969604886617"
    2,2,"0.8964389655603473,0.23368113896545695,0.8813472954005022"
    3,1,"0.5454703921677284,0.7819383771535038,0.3027939452162367"
    4,1,"0.5365210052235699,0.8975240205792763,0.7613943085507672"

After loaded, the dataset has one heterograph with features and labels:

.. code:: python

    import dgl
    dataset = dgl.data.DGLCSVDataset('./mini_hetero_dataset')
    g = dataset[0]  # only one graph
    print(g)
    #Graph(num_nodes={'item': 5, 'user': 5},
    #    num_edges={('user', 'follow', 'user'): 10, ('user', 'like', 'item'): 10},
    #    metagraph=[('user', 'user', 'follow'), ('user', 'item', 'like')])
    g.nodes['user'].data
    #{'label': tensor([2, 1, 2, 1, 1]), 'feat': tensor([[0.5401, 0.7588, 0.4268],
    #        [0.0868, 0.1145, 0.7197],
    #        [0.8964, 0.2337, 0.8813],
    #        [0.5455, 0.7819, 0.3028],
    #        [0.5365, 0.8975, 0.7614]], dtype=torch.float64)}
    g.edges['like'].data
    #{'label': tensor([1, 2, 2, 0, 1, 2, 1, 0, 2, 1]), 'feat': tensor([[0.7368, 0.1052, 0.9419],
    #        [0.5749, 0.2018, 0.4909],
    #        [0.7697, 0.4940, 0.1086],
    #        [0.1364, 0.1393, 0.7902],
    #        [0.4299, 0.1839, 0.1843],
    #        [0.8613, 0.6799, 0.6580],
    #        [0.6595, 0.2650, 0.7891],
    #        [0.3665, 0.9512, 0.8495],
    #        [0.6986, 0.0386, 0.5564],
    #        [0.5227, 0.3148, 0.4756]], dtype=torch.float64)}

Dataset of multiple graphs
^^^^^^^^^^^^^^^^^^^^^^^^^^

When the dataset contains multiple graphs(for now, only homograph is supported) with node/edge/graph
level features, there need only 4 files in the data folder: ``meta.yaml``, one CSV file for
nodes/edge/graphs respectively:

.. code::

    ./mini_multi_dataset/
    |-- meta.yaml
    |-- nodes.csv
    |-- edges.csv
    |-- graphs.csv

``meta.yaml``:

.. code:: yaml

    dataset_name: mini_multi_dataset
    edge_data:
    - file_name: edges.csv
    node_data:
    - file_name: nodes.csv
    graph_data:
    file_name: graphs.csv

.. note::
    ``graph_id`` should be specified in nodes/edges/graphs CSV files or default value ``0`` is
    used instead which probably caused unexpected/undefined behavior.

``edges.csv``:

.. code::

    graph_id,src_id,dst_id,feat
    0,0,4,"0.39534097273254654,0.9422093637539785,0.634899790318452"
    0,3,0,"0.04486384200747007,0.6453746567017163,0.8757520744192612"
    0,3,2,"0.9397636966928355,0.6526403892728874,0.8643238446466464"
    0,1,1,"0.40559906615287566,0.9848072295736628,0.493888090726854"
    0,4,1,"0.253458867276219,0.9168191778828504,0.47224962583565544"
    0,0,1,"0.3219496197945605,0.3439899477636117,0.7051530741717352"
    0,2,1,"0.692873149428549,0.4770019763881086,0.21937428942781778"
    0,4,0,"0.620118223673067,0.08691420300562658,0.86573472329756"
    0,2,1,"0.00743445923710373,0.5251800239734318,0.054016385555202384"
    0,4,1,"0.6776417760682221,0.7291568018841328,0.4523600060547709"
    1,1,3,"0.6375445528248924,0.04878384701995819,0.4081642382536248"
    1,0,4,"0.776002616178397,0.8851294998284638,0.7321742043493028"
    1,1,0,"0.0928555079874982,0.6156748364694707,0.6985674921582508"
    1,0,2,"0.31328748118329997,0.8326121496142408,0.04133991340612775"
    1,1,0,"0.36786902637778773,0.39161865931662243,0.9971749359397111"
    1,1,1,"0.4647410679872376,0.8478810655406659,0.6746269314422184"
    1,0,2,"0.8117650553546695,0.7893727601272978,0.41527155506593394"
    1,1,3,"0.40707309111756307,0.2796588354307046,0.34846782265758314"
    1,1,0,"0.18626464175355095,0.3523777809254057,0.7863421810531344"
    1,3,0,"0.28357022069634585,0.13774964202156292,0.5913335505943637"

``nodes.csv``:

.. code::

    graph_id,node_id,feat
    0,0,"0.5725330322207948,0.8451870383322376,0.44412796119211184"
    0,1,"0.6624186423087752,0.6118386331195641,0.7352138669985214"
    0,2,"0.7583372765843964,0.15218126307872892,0.6810484348765842"
    0,3,"0.14627522432017592,0.7457985352827006,0.1037097085190507"
    0,4,"0.49037522512771525,0.8778998699783784,0.0911194482288028"
    1,0,"0.11158102039672668,0.08543289788089736,0.6901745368284345"
    1,1,"0.28367647637469273,0.07502571020414439,0.01217200152200748"
    1,2,"0.2472495901894738,0.24285506608575758,0.6494437360242048"
    1,3,"0.5614197853127827,0.059172654879085296,0.4692371689047904"
    1,4,"0.17583413999295983,0.5191278830882644,0.8453123358491914"

``graphs.csv``:

.. code::

    graph_id,feat,label
    0,"0.7426272601929126,0.5197462471155317,0.8149104951283953",0
    1,"0.534822233529295,0.2863627767733977,0.1154897249106891",0


After loaded, the dataset has multiple homographs with features and labels:

.. code:: python

    import dgl
    dataset = dgl.data.DGLCSVDataset('./mini_multi_dataset')
    print(len(dataset))
    #2
    graph, label = dataset[0]
    print(graph, label)
    #Graph(num_nodes=5, num_edges=10,
    #    ndata_schemes={'feat': Scheme(shape=(3,), dtype=torch.float64)}
    #    edata_schemes={'feat': Scheme(shape=(3,), dtype=torch.float64)}) tensor(0)
    print(dataset.data)
    #{'feat': tensor([[0.7426, 0.5197, 0.8149],
    #        [0.5348, 0.2864, 0.1155]], dtype=torch.float64), 'label': tensor([0, 0])}

YAML Specification
~~~~~~~~~~~~~~~~~~

Example
^^^^^^^

In the YAML file below, all supported keys are listed together including those that have default
values though not all the keys are required for a specific use.

.. code:: yaml

    version: 1.0.0
    dataset_name: full_yaml
    separator: ','
    edge_data:
    - file_name: edges_0.csv
      etype:
      - user
      - follow
      - user
      src_id_field: src_id
      dst_id_field: dst_id
    - file_name: edges_1.csv
      etype:
      - user
      - like
      - item
      src_id_field: src_id
      dst_id_field: dst_id
    node_data:
    - file_name: nodes_0.csv
      ntype: user
      node_id_field: node_id
    - file_name: nodes_1.csv
      ntype: item
      node_id_field: node_id
    graph_data:
      file_name: graphs.csv
      graph_id_field: graph_id

Top-level keys
^^^^^^^^^^^^^^

At the top level, only 6 keys are available.

``version``
Optional. String. It specifies which version of ``meta.yaml`` is used. more feature may be added and
version is changed accordingly.

``dataset_name``
Required. String. It specifies the dataset name.

``separator``
Optional. String. It specifies how to parse data in CSV files. Default value: ``,``.

``edge_data``
Required. List of dict. It includes several sub-keys to help parse edges from CSV files.

``node_data``
Required. List of dict. It includes several sub-keys to help parse nodes from CSV files.

``graph_data``
Required. Dict. It includes several sub-keys to help parse graph-level information from CSV files.

Keys for ``edge_data``
^^^^^^^^^^^^^^^^^^^^^^

``file_name``
Required. String. It specifies the file name which stores edge data.

``etype``
Optional. List of string. It specifies the canonical edge type.

``src_id_field``
Optional. String. It specifies which column to be read for src ids. Default value: ``src_id``.

``dst_id_field``
Optional. String. It specifies which column to be read for dst ids. Default value: ``dst_id``.

Keys for ``node_data``
^^^^^^^^^^^^^^^^^^^^^^

``file_name``
Required. String. It specifies the file name which stores node data.

``ntype``
Optional. List of string. It specifies the canonical node type.

``node_id_field``
Optional. String. It specifies which column to be read for node ids. Default value: ``node_id``.

Keys for ``graph_data``
^^^^^^^^^^^^^^^^^^^^^^

``file_name``
Required. String. It specifies the file name which stores graph data.

``graph_id_field``
Optional. String. It specifies which column to be read for graph ids. Default value: ``graph_id``.

Parse node/edge/grpah data on your own
~~~~~~~~~~~~~~~~~~~~~~~~

In default, all the data are attached to ``g.ndata`` with the same key as column name in ``nodes.csv``
except ``node_id``. So does data in ``edges.csv``. Data is auto-formatted via ``pandas`` unless it's
a string of float values(feature data is often of this format). For better experience, user is able
to self-define node/edge/graph data parser which is callable and accept ``pandas.DataFrame`` as input
data. Then pass such callable instance while instantiating ``DGLCSVDataset``. Below is an example.

``SelfDefinedDataParser``:

.. code:: python

    import numpy as np
    import ast
    import pandas as pd

    class SelfDefinedDataParser:
        """Convert labels which are in string format into numeric values.
        """
        def __call__(self, df: pd.DataFrame):
            data = {}
            for header in df:
                if 'Unnamed' in header:
                    print("Unamed column is found. Ignored...")
                    continue
                dt = df[header].to_numpy().squeeze()
                if header == 'label':
                    dt = np.array([1 if e == 'positive' else 0 for e in dt])
                data[header] = dt
            return data

Example:

``customized_parser_dataset``:

.. code::

    ./customized_parser_dataset/
    |-- meta.yaml
    |-- nodes.csv
    |-- edges.csv

``meta.yaml``:

.. code:: yaml

    dataset_name: customized_parser_dataset
    edge_data:
    - file_name: edges.csv
    node_data:
    - file_name: nodes.csv

``edges.csv``:

.. code::

    src_id,dst_id,label
    4,0,positive
    4,0,negative
    0,3,positive
    0,1,positive
    0,2,negative
    0,0,positive
    2,2,negative
    1,0,positive
    3,0,negative
    4,0,positive

``nodes.csv``:

.. code::

    node_id,label
    0,positive
    1,negative
    2,positive
    3,negative
    4,positive

After loaded, the dataset has one graph with features and labels:

.. code:: python

    import dgl
    dataset = dgl.data.DGLCSVDataset('./customized_parser_dataset',
                                     node_data_parser={'_V': SelfDefinedDataParser()},
                                     edge_data_parser={('_V','_E','_V'): SelfDefinedDataParser()})
    print(dataset[0].ndata['label'])
    #tensor([1, 0, 1, 0, 1])
    print(dataset[0].edata['label'])
    #tensor([1, 0, 1, 1, 0, 1, 0, 1, 0, 1])

FAQs:
~~~~~

What's the data type in CSV files?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

A default data parser is used for parsing node/edge/graph csv files in default which infer data type
automatically. ID related data such as ``node_id``, ``src_id``, ``dst_id``, ``graph_id`` are required
to be ``numeric`` as these fields are used for constructing graph. Any other data will be attached to
``g.ndata`` or ``g.edata`` directly, so it's user's responsibility to make sure the data type is expected
when using within graph. In particular, ``string`` data which is composed of ``float`` values is splitted
and cast into float value array by default data parser.

What if some lines in CSV have missing values in several fields?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

It’s undefined behavior. Please make sure the data is complete.

What if ``graph_id`` is not specified in CSV?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

For a single graph, such field in ``edge_data`` and ``node_data`` is not used at all. So it’s ok
to ignore it. For multiple graphs, ``graph_id`` should be provided, or all edge/node data will be
regarded as ``graph_id = 0``. This usually is not what you expect.