data-loadcsv.rst

.. _guide-data-pipeline-loadcsv:

4.6 Loading data from CSV files
----------------------------------------------

Comma Separated Value (CSV) is a widely used data storage format. DGL provides
:class:`~dgl.data.CSVDataset` for loading and parsing graph data stored in
CSV format.

To create a ``CSVDataset`` object:

.. code:: python

    import dgl
    ds = dgl.data.CSVDataset('/path/to/dataset')

The returned ``ds`` object is a standard :class:`~dgl.data.DGLDataset`. For
example, one can get graph samples using ``__getitem__`` as well as node/edge
features using ``ndata``/``edata``.

.. code:: python

    # A demonstration of how to use the loaded dataset. The feature names
    # may vary depending on the CSV contents.
    g = ds[0] # get the graph
    label = g.ndata['label']
    feat = g.ndata['feat']

Data folder structure
~~~~~~~~~~~~~~~~~~~~~

.. code::

    /path/to/dataset/
    |-- meta.yaml     # metadata of the dataset
    |-- edges_0.csv   # edge data including src_id, dst_id, feature, label and so on
    |-- ...           # you can have as many CSVs for edge data as you want
    |-- nodes_0.csv   # node data including node_id, feature, label and so on
    |-- ...           # you can have as many CSVs for node data as you want
    |-- graphs.csv    # graph-level features

Node/edge/graph-level data are stored in CSV files. ``meta.yaml`` is a metadata file specifying
where to read nodes/edges/graphs data and how to parse them to construct the dataset
object. A minimal data folder contains one ``meta.yaml`` and two CSVs, one for node data and one
for edge data, in which case the dataset contains only a single graph with no graph-level data.

Dataset of a single feature-less graph
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

When the dataset contains only one graph with no node or edge features, there need only three
files in the data folder: ``meta.yaml``, one CSV for node IDs and one CSV for edges:

.. code::

    ./mini_featureless_dataset/
    |-- meta.yaml
    |-- nodes.csv
    |-- edges.csv

``meta.yaml`` contains the following information:

.. code:: yaml

    dataset_name: mini_featureless_dataset
    edge_data:
    - file_name: edges.csv
    node_data:
    - file_name: nodes.csv

``nodes.csv`` lists the node IDs under the ``node_id`` field:

.. code::

    node_id
    0
    1
    2
    3
    4

``edges.csv`` lists all the edges in two columns (``src_id`` and ``dst_id``) specifying the
source and destination node ID of each edge:

.. code::

    src_id,dst_id
    4,4
    4,1
    3,0
    4,1
    4,0
    1,2
    1,3
    3,3
    1,1
    4,1

After loaded, the dataset has one graph without any features:

.. code:: python

    >>> import dgl
    >>> dataset = dgl.data.CSVDataset('./mini_featureless_dataset')
    >>> g = dataset[0]  # only one graph
    >>> print(g)
    Graph(num_nodes=5, num_edges=10,
          ndata_schemes={}
          edata_schemes={})

.. note::
    Non-integer node IDs are allowed. When constructing the graph, ``CSVDataset`` will
    map each raw ID to an integer ID starting from zero.

.. note::
    Edges are always directed. To have both directions, add reversed edges in the edge
    CSV file or use :class:`~dgl.transform.AddReverse` to transform the loaded graph.


A graph without any feature is often of less interest. In the next example, we will show
how to load and parse node or edge features.

Dataset of a single graph with features and labels
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

When the dataset contains a single graph with node or edge features and labels, there still
need only three files in the data folder: ``meta.yaml``, one CSV for node IDs and one CSV
for edges:

.. code::

    ./mini_feature_dataset/
    |-- meta.yaml
    |-- nodes.csv
    |-- edges.csv

``meta.yaml``:

.. code:: yaml

    dataset_name: mini_feature_dataset
    edge_data:
    - file_name: edges.csv
    node_data:
    - file_name: nodes.csv

``edges.csv`` with five synthetic edge data (``label``, ``train_mask``, ``val_mask``, ``test_mask``, ``feat``):

.. code::

    src_id,dst_id,label,train_mask,val_mask,test_mask,feat
    4,0,2,False,True,True,"0.5477868606453535, 0.4470617033458436, 0.936706701616337"
    4,0,0,False,False,True,"0.9794634290792008, 0.23682038840665198, 0.049629338970987646"
    0,3,1,True,True,True,"0.8586722047523594, 0.5746912787380253, 0.6462162561249654"
    0,1,2,True,False,False,"0.2730008213674695, 0.5937484188166621, 0.765544096939567"
    0,2,1,True,True,True,"0.45441619816038514, 0.1681403185591509, 0.9952376085297715"
    0,0,0,False,False,False,"0.4197669213305396, 0.849983324532477, 0.16974127573016262"
    2,2,1,False,True,True,"0.5495035052928215, 0.21394654203489705, 0.7174910641836348"
    1,0,2,False,True,False,"0.008790817766266334, 0.4216530595907526, 0.529195480661293"
    3,0,0,True,True,True,"0.6598715708878852, 0.1932390907048961, 0.9774471538377553"
    4,0,1,False,False,False,"0.16846068931179736, 0.41516080644186737, 0.002158116134429955"


``nodes.csv`` with five synthetic node data (``label``, ``train_mask``, ``val_mask``, ``test_mask``, ``feat``):

.. code::

    node_id,label,train_mask,val_mask,test_mask,feat
    0,1,False,True,True,"0.07816474278491703, 0.9137336384979067, 0.4654086994009452"
    1,1,True,True,True,"0.05354099924658973, 0.8753101998792645, 0.33929432608774135"
    2,1,True,False,True,"0.33234211884156384, 0.9370522452510665, 0.6694943496824788"
    3,0,False,True,False,"0.9784264442230887, 0.22131880861864428, 0.3161154827254189"
    4,1,True,True,False,"0.23142237259162102, 0.8715767748481147, 0.19117861103555467"

After loaded, the dataset has one graph. Node/edge features are stored in ```ndata`` and ``edata``
with the same column names. The example demonstrates how to specify a vector-shaped feature --
using comma-separated list enclosed by double quotes ``"..."``.

.. code:: python

    >>> import dgl
    >>> dataset = dgl.data.CSVDataset('./mini_feature_dataset')
    >>> g = dataset[0]  # only one graph
    >>> print(g)
    Graph(num_nodes=5, num_edges=10,
          ndata_schemes={'label': Scheme(shape=(), dtype=torch.int64), 'train_mask': Scheme(shape=(), dtype=torch.bool), 'val_mask': Scheme(shape=(), dtype=torch.bool), 'test_mask': Scheme(shape=(), dtype=torch.bool), 'feat': Scheme(shape=(3,), dtype=torch.float64)}
          edata_schemes={'label': Scheme(shape=(), dtype=torch.int64), 'train_mask': Scheme(shape=(), dtype=torch.bool), 'val_mask': Scheme(shape=(), dtype=torch.bool), 'test_mask': Scheme(shape=(), dtype=torch.bool), 'feat': Scheme(shape=(3,), dtype=torch.float64)})

.. note::
    By default, ``CSVDatatset`` assumes all feature data to be numerical values (e.g., int, float, bool or
    list) and missing values are not allowed. Users could provide custom data parser for these cases.
    See `Custom Data Parser`_ for more details.

Dataset of a single heterogeneous graph
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

One can specify multiple node and edge CSV files (each for one type) to represent a heterogeneous graph.
Here is an example data with two node types and two edge types:

.. code::

    ./mini_hetero_dataset/
    |-- meta.yaml
    |-- nodes_0.csv
    |-- nodes_1.csv
    |-- edges_0.csv
    |-- edges_1.csv

The ``meta.yaml`` specifies the node type name (using ``ntype``) and edge type name (using ``etype``)
of each CSV file. The edge type name is a string triplet containing the source node type name, relation
name and the destination node type name.

.. code:: yaml

    dataset_name: mini_hetero_dataset
    edge_data:
    - file_name: edges_0.csv
      etype: [user, follow, user]
    - file_name: edges_1.csv
      etype: [user, like, item]
    node_data:
    - file_name: nodes_0.csv
      ntype: user
    - file_name: nodes_1.csv
      ntype: item

The node and edge CSV files follow the same format as in homogeneous graphs. Here are some synthetic
data for demonstration purposes:

``edges_0.csv`` and ``edges_1.csv``:

.. code::

    src_id,dst_id,label,feat
    4,4,1,"0.736833152378035,0.10522806046048205,0.9418796835016118"
    3,4,2,"0.5749339182767451,0.20181320245665535,0.490938012147181"
    1,4,2,"0.7697294432580938,0.49397782380750765,0.10864079337442234"
    0,4,0,"0.1364240150959487,0.1393107840629273,0.7901988878812207"
    2,3,1,"0.42988138237505735,0.18389137408509248,0.18431292077750894"
    0,4,2,"0.8613368738351794,0.67985810014162,0.6580438064356824"
    2,4,1,"0.6594951663841697,0.26499036865016423,0.7891429392727503"
    4,1,0,"0.36649684241348557,0.9511783938523962,0.8494919263589972"
    1,1,2,"0.698592283371875,0.038622249776255946,0.5563827995742111"
    0,4,1,"0.5227112950269823,0.3148264185956532,0.47562693094002173"

``nodes_0.csv`` and ``nodes_1.csv``:

.. code::

    node_id,label,feat
    0,2,"0.5400687466285844,0.7588441197954202,0.4268254673041745"
    1,1,"0.08680051341900807,0.11446843700743892,0.7196969604886617"
    2,2,"0.8964389655603473,0.23368113896545695,0.8813472954005022"
    3,1,"0.5454703921677284,0.7819383771535038,0.3027939452162367"
    4,1,"0.5365210052235699,0.8975240205792763,0.7613943085507672"

After loaded, the dataset has one heterograph with features and labels:

.. code:: python

    >>> import dgl
    >>> dataset = dgl.data.CSVDataset('./mini_hetero_dataset')
    >>> g = dataset[0]  # only one graph
    >>> print(g)
    Graph(num_nodes={'item': 5, 'user': 5},
          num_edges={('user', 'follow', 'user'): 10, ('user', 'like', 'item'): 10},
          metagraph=[('user', 'user', 'follow'), ('user', 'item', 'like')])
    >>> g.nodes['user'].data
    {'label': tensor([2, 1, 2, 1, 1]), 'feat': tensor([[0.5401, 0.7588, 0.4268],
            [0.0868, 0.1145, 0.7197],
            [0.8964, 0.2337, 0.8813],
            [0.5455, 0.7819, 0.3028],
            [0.5365, 0.8975, 0.7614]], dtype=torch.float64)}
    >>> g.edges['like'].data
    {'label': tensor([1, 2, 2, 0, 1, 2, 1, 0, 2, 1]), 'feat': tensor([[0.7368, 0.1052, 0.9419],
            [0.5749, 0.2018, 0.4909],
            [0.7697, 0.4940, 0.1086],
            [0.1364, 0.1393, 0.7902],
            [0.4299, 0.1839, 0.1843],
            [0.8613, 0.6799, 0.6580],
            [0.6595, 0.2650, 0.7891],
            [0.3665, 0.9512, 0.8495],
            [0.6986, 0.0386, 0.5564],
            [0.5227, 0.3148, 0.4756]], dtype=torch.float64)}

Dataset of multiple graphs
~~~~~~~~~~~~~~~~~~~~~~~~~~

When there are multiple graphs, one can include an additional CSV file for storing graph-level features.
Here is an example:

.. code::

    ./mini_multi_dataset/
    |-- meta.yaml
    |-- nodes.csv
    |-- edges.csv
    |-- graphs.csv

Accordingly, the ``meta.yaml`` should include an extra ``graph_data`` key to tell which CSV file to
load graph-level features from.

.. code:: yaml

    dataset_name: mini_multi_dataset
    edge_data:
    - file_name: edges.csv
    node_data:
    - file_name: nodes.csv
    graph_data:
    - file_name: graphs.csv

To distinguish nodes and edges of different graphs, the ``node.csv`` and ``edge.csv`` must contain
an extra column ``graph_id``:

``edges.csv``:

.. code::

    graph_id,src_id,dst_id,feat
    0,0,4,"0.39534097273254654,0.9422093637539785,0.634899790318452"
    0,3,0,"0.04486384200747007,0.6453746567017163,0.8757520744192612"
    0,3,2,"0.9397636966928355,0.6526403892728874,0.8643238446466464"
    0,1,1,"0.40559906615287566,0.9848072295736628,0.493888090726854"
    0,4,1,"0.253458867276219,0.9168191778828504,0.47224962583565544"
    0,0,1,"0.3219496197945605,0.3439899477636117,0.7051530741717352"
    0,2,1,"0.692873149428549,0.4770019763881086,0.21937428942781778"
    0,4,0,"0.620118223673067,0.08691420300562658,0.86573472329756"
    0,2,1,"0.00743445923710373,0.5251800239734318,0.054016385555202384"
    0,4,1,"0.6776417760682221,0.7291568018841328,0.4523600060547709"
    1,1,3,"0.6375445528248924,0.04878384701995819,0.4081642382536248"
    1,0,4,"0.776002616178397,0.8851294998284638,0.7321742043493028"
    1,1,0,"0.0928555079874982,0.6156748364694707,0.6985674921582508"
    1,0,2,"0.31328748118329997,0.8326121496142408,0.04133991340612775"
    1,1,0,"0.36786902637778773,0.39161865931662243,0.9971749359397111"
    1,1,1,"0.4647410679872376,0.8478810655406659,0.6746269314422184"
    1,0,2,"0.8117650553546695,0.7893727601272978,0.41527155506593394"
    1,1,3,"0.40707309111756307,0.2796588354307046,0.34846782265758314"
    1,1,0,"0.18626464175355095,0.3523777809254057,0.7863421810531344"
    1,3,0,"0.28357022069634585,0.13774964202156292,0.5913335505943637"

``nodes.csv``:

.. code::

    graph_id,node_id,feat
    0,0,"0.5725330322207948,0.8451870383322376,0.44412796119211184"
    0,1,"0.6624186423087752,0.6118386331195641,0.7352138669985214"
    0,2,"0.7583372765843964,0.15218126307872892,0.6810484348765842"
    0,3,"0.14627522432017592,0.7457985352827006,0.1037097085190507"
    0,4,"0.49037522512771525,0.8778998699783784,0.0911194482288028"
    1,0,"0.11158102039672668,0.08543289788089736,0.6901745368284345"
    1,1,"0.28367647637469273,0.07502571020414439,0.01217200152200748"
    1,2,"0.2472495901894738,0.24285506608575758,0.6494437360242048"
    1,3,"0.5614197853127827,0.059172654879085296,0.4692371689047904"
    1,4,"0.17583413999295983,0.5191278830882644,0.8453123358491914"

The ``graphs.csv`` contains a ``graph_id`` column and arbitrary number of feature columns.
The example dataset here has two graphs, each with a ``feat`` and a ``label`` graph-level
data.

.. code::

    graph_id,feat,label
    0,"0.7426272601929126,0.5197462471155317,0.8149104951283953",0
    1,"0.534822233529295,0.2863627767733977,0.1154897249106891",0

After loaded, the dataset has multiple homographs with features and labels:

.. code:: python

    >>> import dgl
    >>> dataset = dgl.data.CSVDataset('./mini_multi_dataset')
    >>> print(len(dataset))
    2
    >>> graph0, data0 = dataset[0]
    >>> print(graph0)
    Graph(num_nodes=5, num_edges=10,
          ndata_schemes={'feat': Scheme(shape=(3,), dtype=torch.float64)}
          edata_schemes={'feat': Scheme(shape=(3,), dtype=torch.float64)})
    >>> print(data0)
    {'feat': tensor([0.7426, 0.5197, 0.8149]), 'label': tensor([0])}
    >>> graph1, data1 = dataset[1]
    >>> print(graph1)
    Graph(num_nodes=5, num_edges=10,
          ndata_schemes={'feat': Scheme(shape=(3,), dtype=torch.float64)}
          edata_schemes={'feat': Scheme(shape=(3,), dtype=torch.float64)})
    >>> print(data1)
    {'feat': tensor([0.5348, 0.2864, 0.1155]), 'label': tensor([0])}

.. note::

    When there are multiple graphs, ``CSVDataset`` currently requires them to be homogeneous.


Custom Data Parser
~~~~~~~~~~~~~~~~~~

By default, ``CSVDataset`` assumes that all the stored node-/edge-/graph- level data are numerical
values. Users can provide custom ``DataParser`` to ``CSVDataset`` to handle more complex
data type. A ``DataParser`` needs to implement the ``__call__`` method which takes in the
:class:`pandas.DataFrame` object created from CSV file and should return a dictionary of
parsed feature data. The parsed feature data will be saved to the ``ndata`` and ``edata`` of
the corresponding ``DGLGraph`` object, and thus must be tensors or numpy arrays. Below shows an example
``DataParser`` which converts string type labels to integers:

Given a dataset as follows,

.. code::

    ./customized_parser_dataset/
    |-- meta.yaml
    |-- nodes.csv
    |-- edges.csv

``meta.yaml``:

.. code:: yaml

    dataset_name: customized_parser_dataset
    edge_data:
    - file_name: edges.csv
    node_data:
    - file_name: nodes.csv

``edges.csv``:

.. code::

    src_id,dst_id,label
    4,0,positive
    4,0,negative
    0,3,positive
    0,1,positive
    0,2,negative
    0,0,positive
    2,2,negative
    1,0,positive
    3,0,negative
    4,0,positive

``nodes.csv``:

.. code::

    node_id,label
    0,positive
    1,negative
    2,positive
    3,negative
    4,positive

To parse the string type labels, one can define a ``DataParser`` class as follows:

.. code:: python

    import numpy as np
    import pandas as pd

    class MyDataParser:
        def __call__(self, df: pd.DataFrame):
            parsed = {}
            for header in df:
                if 'Unnamed' in header:  # Handle Unnamed column
                    print("Unamed column is found. Ignored...")
                    continue
                dt = df[header].to_numpy().squeeze()
                if header == 'label':
                    dt = np.array([1 if e == 'positive' else 0 for e in dt])
                parsed[header] = dt
            return parsed

 Create a ``CSVDataset`` using the defined ``DataParser``:

.. code:: python

    >>> dataset = dgl.data.CSVDataset('./customized_parser_dataset',
    ...                               ndata_parser=MyDataParser(),
    ...                               edata_parser=MyDataParser())
    >>> print(dataset[0].ndata['label'])
    tensor([1, 0, 1, 0, 1])
    >>> print(dataset[0].edata['label'])
    tensor([1, 0, 1, 1, 0, 1, 0, 1, 0, 1])

.. note::

    To specify different ``DataParser`` s for different node/edge types, pass a dictionary to
    ``ndata_parser`` and ``edata_parser``, where the key is type name (a single string for
    node type; a string triplet for edge type) and the value is the ``DataParser`` to use.


Full YAML Specification
~~~~~~~~~~~~~~~~~~~~~~~

``CSVDataset`` allows more flexible control over the loading and parsing process. For example, one
can change the ID column names via ``meta.yaml``. The example below lists all the supported keys.

.. code:: yaml

    version: 1.0.0
    dataset_name: some_complex_data
    separator: ','                   # CSV separator symbol. Default: ','
    edge_data:
    - file_name: edges_0.csv
      etype: [user, follow, user]
      src_id_field: src_id           # Column name for source node IDs. Default: src_id
      dst_id_field: dst_id           # Column name for destination node IDs. Default: dst_id
    - file_name: edges_1.csv
      etype: [user, like, item]
      src_id_field: src_id
      dst_id_field: dst_id
    node_data:
    - file_name: nodes_0.csv
      ntype: user
      node_id_field: node_id         # Column name for node IDs. Default: node_id
    - file_name: nodes_1.csv
      ntype: item
      node_id_field: node_id         # Column name for node IDs. Default: node_id
    graph_data:
      file_name: graphs.csv
      graph_id_field: graph_id       # Column name for graph IDs. Default: graph_id

Top-level
^^^^^^^^^^^^^^

At the top level, only 6 keys are available:

  - ``version``: Optional. String.
    It specifies which version of ``meta.yaml`` is used. More feature may be added in the future.
  - ``dataset_name``: Required. String.
    It specifies the dataset name.
  - ``separator``: Optional. String.
    It specifies how to parse data in CSV files. Default: ``','``.
  - ``edge_data``: Required. List of ``EdgeData``.
    Meta data for parsing edge CSV files.
  - ``node_data``: Required. List of ``NodeData``.
    Meta data for parsing node CSV files.
  - ``graph_data``: Optional. ``GraphData``.
    Meta data for parsing the graph CSV file.

``EdgeData``
^^^^^^^^^^^^^^^^^^^^^^

There are 4 keys:

  - ``file_name``: Required. String.
    The CSV file to load data from.
  - ``etype``: Optional. List of string.
    Edge type name in string triplet: [source node type, relation type, destination node type].
  - ``src_id_field``: Optional. String.
    Which column to read for source node IDs. Default: ``src_id``.
  - ``dst_id_field``: Optional. String.
    Which column to read for destination node IDs. Default: ``dst_id``.

``NodeData``
^^^^^^^^^^^^^^^^^^^^^^

There are 3 keys:

  - ``file_name``: Required. String.
    The CSV file to load data from.
  - ``ntype``: Optional. String.
    Node type name.
  - ``node_id_field``: Optional. String.
    Which column to read for node IDs. Default: ``node_id``.

``GraphData``
^^^^^^^^^^^^^^^^^^^^^^

There are 2 keys:

  - ``file_name``: Required. String.
    The CSV file to load data from.
  - ``graph_id_field``: Optional. String.
    Which column to read for graph IDs. Default: ``graph_id``.