data-loadcsv.rst 19.7 KB
Newer Older
RhettYing's avatar
RhettYing committed
1
2
.. _guide-data-pipeline-loadcsv:

RhettYing's avatar
RhettYing committed
3
4.6 Loading data from CSV files
RhettYing's avatar
RhettYing committed
4
5
----------------------------------------------

RhettYing's avatar
RhettYing committed
6
7
8
Comma Separated Value (CSV) is a widely used data storage format. DGL provides
:class:`~dgl.data.CSVDataset` for loading and parsing graph data stored in
CSV format.
RhettYing's avatar
RhettYing committed
9

RhettYing's avatar
RhettYing committed
10
To create a ``CSVDataset`` object:
RhettYing's avatar
RhettYing committed
11
12
13
14

.. code:: python

    import dgl
RhettYing's avatar
RhettYing committed
15
    ds = dgl.data.CSVDataset('/path/to/dataset')
RhettYing's avatar
RhettYing committed
16

RhettYing's avatar
RhettYing committed
17
18
19
The returned ``ds`` object is a standard :class:`~dgl.data.DGLDataset`. For
example, one can get graph samples using ``__getitem__`` as well as node/edge
features using ``ndata``/``edata``.
RhettYing's avatar
RhettYing committed
20
21
22

.. code:: python

RhettYing's avatar
RhettYing committed
23
24
    # A demonstration of how to use the loaded dataset. The feature names
    # may vary depending on the CSV contents.
RhettYing's avatar
RhettYing committed
25
26
27
28
29
30
31
32
33
34
35
    g = ds[0] # get the graph
    label = g.ndata['label']
    feat = g.ndata['feat']

Data folder structure
~~~~~~~~~~~~~~~~~~~~~

.. code::

    /path/to/dataset/
    |-- meta.yaml     # metadata of the dataset
RhettYing's avatar
RhettYing committed
36
    |-- edges_0.csv   # edge data including src_id, dst_id, feature, label and so on
RhettYing's avatar
RhettYing committed
37
    |-- ...           # you can have as many CSVs for edge data as you want
RhettYing's avatar
RhettYing committed
38
    |-- nodes_0.csv   # node data including node_id, feature, label and so on
RhettYing's avatar
RhettYing committed
39
    |-- ...           # you can have as many CSVs for node data as you want
RhettYing's avatar
RhettYing committed
40
41
42
    |-- graphs.csv    # graph-level features

Node/edge/graph-level data are stored in CSV files. ``meta.yaml`` is a metadata file specifying
RhettYing's avatar
RhettYing committed
43
where to read nodes/edges/graphs data and how to parse them to construct the dataset
RhettYing's avatar
RhettYing committed
44
object. A minimal data folder contains one ``meta.yaml`` and two CSVs, one for node data and one
RhettYing's avatar
RhettYing committed
45
for edge data, in which case the dataset contains only a single graph with no graph-level data.
RhettYing's avatar
RhettYing committed
46
47

Dataset of a single feature-less graph
RhettYing's avatar
RhettYing committed
48
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
RhettYing's avatar
RhettYing committed
49
50

When the dataset contains only one graph with no node or edge features, there need only three
RhettYing's avatar
RhettYing committed
51
files in the data folder: ``meta.yaml``, one CSV for node IDs and one CSV for edges:
RhettYing's avatar
RhettYing committed
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101

.. code::

    ./mini_featureless_dataset/
    |-- meta.yaml
    |-- nodes.csv
    |-- edges.csv

``meta.yaml`` contains the following information:

.. code:: yaml

    dataset_name: mini_featureless_dataset
    edge_data:
    - file_name: edges.csv
    node_data:
    - file_name: nodes.csv

``nodes.csv`` lists the node IDs under the ``node_id`` field:

.. code::

    node_id
    0
    1
    2
    3
    4

``edges.csv`` lists all the edges in two columns (``src_id`` and ``dst_id``) specifying the
source and destination node ID of each edge:

.. code::

    src_id,dst_id
    4,4
    4,1
    3,0
    4,1
    4,0
    1,2
    1,3
    3,3
    1,1
    4,1

After loaded, the dataset has one graph without any features:

.. code:: python

RhettYing's avatar
RhettYing committed
102
103
104
105
106
107
108
    >>> import dgl
    >>> dataset = dgl.data.CSVDataset('./mini_featureless_dataset')
    >>> g = dataset[0]  # only one graph
    >>> print(g)
    Graph(num_nodes=5, num_edges=10,
          ndata_schemes={}
          edata_schemes={})
RhettYing's avatar
RhettYing committed
109

RhettYing's avatar
RhettYing committed
110
111
112
.. note::
    Non-integer node IDs are allowed. When constructing the graph, ``CSVDataset`` will
    map each raw ID to an integer ID starting from zero.
RhettYing's avatar
RhettYing committed
113
114

.. note::
RhettYing's avatar
RhettYing committed
115
116
117
118
119
120
    Edges are always directed. To have both directions, add reversed edges in the edge
    CSV file or use :class:`~dgl.transform.AddReverse` to transform the loaded graph.


A graph without any feature is often of less interest. In the next example, we will show
how to load and parse node or edge features.
RhettYing's avatar
RhettYing committed
121
122

Dataset of a single graph with features and labels
RhettYing's avatar
RhettYing committed
123
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
RhettYing's avatar
RhettYing committed
124

RhettYing's avatar
RhettYing committed
125
When the dataset contains a single graph with node or edge features and labels, there still
RhettYing's avatar
RhettYing committed
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
need only three files in the data folder: ``meta.yaml``, one CSV for node IDs and one CSV
for edges:

.. code::

    ./mini_feature_dataset/
    |-- meta.yaml
    |-- nodes.csv
    |-- edges.csv

``meta.yaml``:

.. code:: yaml

    dataset_name: mini_feature_dataset
    edge_data:
    - file_name: edges.csv
    node_data:
    - file_name: nodes.csv

RhettYing's avatar
RhettYing committed
146
``edges.csv`` with five synthetic edge data (``label``, ``train_mask``, ``val_mask``, ``test_mask``, ``feat``):
RhettYing's avatar
RhettYing committed
147
148
149
150

.. code::

    src_id,dst_id,label,train_mask,val_mask,test_mask,feat
RhettYing's avatar
RhettYing committed
151
152
153
154
155
156
157
158
159
160
    4,0,2,False,True,True,"0.5477868606453535, 0.4470617033458436, 0.936706701616337"
    4,0,0,False,False,True,"0.9794634290792008, 0.23682038840665198, 0.049629338970987646"
    0,3,1,True,True,True,"0.8586722047523594, 0.5746912787380253, 0.6462162561249654"
    0,1,2,True,False,False,"0.2730008213674695, 0.5937484188166621, 0.765544096939567"
    0,2,1,True,True,True,"0.45441619816038514, 0.1681403185591509, 0.9952376085297715"
    0,0,0,False,False,False,"0.4197669213305396, 0.849983324532477, 0.16974127573016262"
    2,2,1,False,True,True,"0.5495035052928215, 0.21394654203489705, 0.7174910641836348"
    1,0,2,False,True,False,"0.008790817766266334, 0.4216530595907526, 0.529195480661293"
    3,0,0,True,True,True,"0.6598715708878852, 0.1932390907048961, 0.9774471538377553"
    4,0,1,False,False,False,"0.16846068931179736, 0.41516080644186737, 0.002158116134429955"
RhettYing's avatar
RhettYing committed
161
162


RhettYing's avatar
RhettYing committed
163
``nodes.csv`` with five synthetic node data (``label``, ``train_mask``, ``val_mask``, ``test_mask``, ``feat``):
RhettYing's avatar
RhettYing committed
164
165
166
167

.. code::

    node_id,label,train_mask,val_mask,test_mask,feat
RhettYing's avatar
RhettYing committed
168
169
170
171
172
    0,1,False,True,True,"0.07816474278491703, 0.9137336384979067, 0.4654086994009452"
    1,1,True,True,True,"0.05354099924658973, 0.8753101998792645, 0.33929432608774135"
    2,1,True,False,True,"0.33234211884156384, 0.9370522452510665, 0.6694943496824788"
    3,0,False,True,False,"0.9784264442230887, 0.22131880861864428, 0.3161154827254189"
    4,1,True,True,False,"0.23142237259162102, 0.8715767748481147, 0.19117861103555467"
RhettYing's avatar
RhettYing committed
173

RhettYing's avatar
RhettYing committed
174
175
176
After loaded, the dataset has one graph. Node/edge features are stored in ```ndata`` and ``edata``
with the same column names. The example demonstrates how to specify a vector-shaped feature --
using comma-separated list enclosed by double quotes ``"..."``.
RhettYing's avatar
RhettYing committed
177
178
179

.. code:: python

RhettYing's avatar
RhettYing committed
180
181
182
183
184
185
186
    >>> import dgl
    >>> dataset = dgl.data.CSVDataset('./mini_feature_dataset')
    >>> g = dataset[0]  # only one graph
    >>> print(g)
    Graph(num_nodes=5, num_edges=10,
          ndata_schemes={'label': Scheme(shape=(), dtype=torch.int64), 'train_mask': Scheme(shape=(), dtype=torch.bool), 'val_mask': Scheme(shape=(), dtype=torch.bool), 'test_mask': Scheme(shape=(), dtype=torch.bool), 'feat': Scheme(shape=(3,), dtype=torch.float64)}
          edata_schemes={'label': Scheme(shape=(), dtype=torch.int64), 'train_mask': Scheme(shape=(), dtype=torch.bool), 'val_mask': Scheme(shape=(), dtype=torch.bool), 'test_mask': Scheme(shape=(), dtype=torch.bool), 'feat': Scheme(shape=(3,), dtype=torch.float64)})
RhettYing's avatar
RhettYing committed
187
188

.. note::
RhettYing's avatar
RhettYing committed
189
190
191
    By default, ``CSVDatatset`` assumes all feature data to be numerical values (e.g., int, float, bool or
    list) and missing values are not allowed. Users could provide custom data parser for these cases.
    See `Custom Data Parser`_ for more details.
RhettYing's avatar
RhettYing committed
192
193

Dataset of a single heterogeneous graph
RhettYing's avatar
RhettYing committed
194
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
RhettYing's avatar
RhettYing committed
195

RhettYing's avatar
RhettYing committed
196
197
One can specify multiple node and edge CSV files (each for one type) to represent a heterogeneous graph.
Here is an example data with two node types and two edge types:
RhettYing's avatar
RhettYing committed
198
199
200
201
202
203
204
205
206
207

.. code::

    ./mini_hetero_dataset/
    |-- meta.yaml
    |-- nodes_0.csv
    |-- nodes_1.csv
    |-- edges_0.csv
    |-- edges_1.csv

RhettYing's avatar
RhettYing committed
208
209
210
The ``meta.yaml`` specifies the node type name (using ``ntype``) and edge type name (using ``etype``)
of each CSV file. The edge type name is a string triplet containing the source node type name, relation
name and the destination node type name.
RhettYing's avatar
RhettYing committed
211
212
213
214
215
216

.. code:: yaml

    dataset_name: mini_hetero_dataset
    edge_data:
    - file_name: edges_0.csv
RhettYing's avatar
RhettYing committed
217
      etype: [user, follow, user]
RhettYing's avatar
RhettYing committed
218
    - file_name: edges_1.csv
RhettYing's avatar
RhettYing committed
219
      etype: [user, like, item]
RhettYing's avatar
RhettYing committed
220
221
222
223
224
225
    node_data:
    - file_name: nodes_0.csv
      ntype: user
    - file_name: nodes_1.csv
      ntype: item

RhettYing's avatar
RhettYing committed
226
227
228
229
The node and edge CSV files follow the same format as in homogeneous graphs. Here are some synthetic
data for demonstration purposes:

``edges_0.csv`` and ``edges_1.csv``:
RhettYing's avatar
RhettYing committed
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244

.. code::

    src_id,dst_id,label,feat
    4,4,1,"0.736833152378035,0.10522806046048205,0.9418796835016118"
    3,4,2,"0.5749339182767451,0.20181320245665535,0.490938012147181"
    1,4,2,"0.7697294432580938,0.49397782380750765,0.10864079337442234"
    0,4,0,"0.1364240150959487,0.1393107840629273,0.7901988878812207"
    2,3,1,"0.42988138237505735,0.18389137408509248,0.18431292077750894"
    0,4,2,"0.8613368738351794,0.67985810014162,0.6580438064356824"
    2,4,1,"0.6594951663841697,0.26499036865016423,0.7891429392727503"
    4,1,0,"0.36649684241348557,0.9511783938523962,0.8494919263589972"
    1,1,2,"0.698592283371875,0.038622249776255946,0.5563827995742111"
    0,4,1,"0.5227112950269823,0.3148264185956532,0.47562693094002173"

RhettYing's avatar
RhettYing committed
245
``nodes_0.csv`` and ``nodes_1.csv``:
RhettYing's avatar
RhettYing committed
246
247
248
249
250
251
252
253
254
255
256
257
258
259

.. code::

    node_id,label,feat
    0,2,"0.5400687466285844,0.7588441197954202,0.4268254673041745"
    1,1,"0.08680051341900807,0.11446843700743892,0.7196969604886617"
    2,2,"0.8964389655603473,0.23368113896545695,0.8813472954005022"
    3,1,"0.5454703921677284,0.7819383771535038,0.3027939452162367"
    4,1,"0.5365210052235699,0.8975240205792763,0.7613943085507672"

After loaded, the dataset has one heterograph with features and labels:

.. code:: python

RhettYing's avatar
RhettYing committed
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
    >>> import dgl
    >>> dataset = dgl.data.CSVDataset('./mini_hetero_dataset')
    >>> g = dataset[0]  # only one graph
    >>> print(g)
    Graph(num_nodes={'item': 5, 'user': 5},
          num_edges={('user', 'follow', 'user'): 10, ('user', 'like', 'item'): 10},
          metagraph=[('user', 'user', 'follow'), ('user', 'item', 'like')])
    >>> g.nodes['user'].data
    {'label': tensor([2, 1, 2, 1, 1]), 'feat': tensor([[0.5401, 0.7588, 0.4268],
            [0.0868, 0.1145, 0.7197],
            [0.8964, 0.2337, 0.8813],
            [0.5455, 0.7819, 0.3028],
            [0.5365, 0.8975, 0.7614]], dtype=torch.float64)}
    >>> g.edges['like'].data
    {'label': tensor([1, 2, 2, 0, 1, 2, 1, 0, 2, 1]), 'feat': tensor([[0.7368, 0.1052, 0.9419],
            [0.5749, 0.2018, 0.4909],
            [0.7697, 0.4940, 0.1086],
            [0.1364, 0.1393, 0.7902],
            [0.4299, 0.1839, 0.1843],
            [0.8613, 0.6799, 0.6580],
            [0.6595, 0.2650, 0.7891],
            [0.3665, 0.9512, 0.8495],
            [0.6986, 0.0386, 0.5564],
            [0.5227, 0.3148, 0.4756]], dtype=torch.float64)}
RhettYing's avatar
RhettYing committed
284
285

Dataset of multiple graphs
RhettYing's avatar
RhettYing committed
286
~~~~~~~~~~~~~~~~~~~~~~~~~~
RhettYing's avatar
RhettYing committed
287

RhettYing's avatar
RhettYing committed
288
289
When there are multiple graphs, one can include an additional CSV file for storing graph-level features.
Here is an example:
RhettYing's avatar
RhettYing committed
290
291
292
293
294
295
296
297
298

.. code::

    ./mini_multi_dataset/
    |-- meta.yaml
    |-- nodes.csv
    |-- edges.csv
    |-- graphs.csv

RhettYing's avatar
RhettYing committed
299
300
Accordingly, the ``meta.yaml`` should include an extra ``graph_data`` key to tell which CSV file to
load graph-level features from.
RhettYing's avatar
RhettYing committed
301
302
303
304
305
306
307
308
309

.. code:: yaml

    dataset_name: mini_multi_dataset
    edge_data:
    - file_name: edges.csv
    node_data:
    - file_name: nodes.csv
    graph_data:
RhettYing's avatar
RhettYing committed
310
    - file_name: graphs.csv
RhettYing's avatar
RhettYing committed
311

RhettYing's avatar
RhettYing committed
312
313
To distinguish nodes and edges of different graphs, the ``node.csv`` and ``edge.csv`` must contain
an extra column ``graph_id``:
RhettYing's avatar
RhettYing committed
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356

``edges.csv``:

.. code::

    graph_id,src_id,dst_id,feat
    0,0,4,"0.39534097273254654,0.9422093637539785,0.634899790318452"
    0,3,0,"0.04486384200747007,0.6453746567017163,0.8757520744192612"
    0,3,2,"0.9397636966928355,0.6526403892728874,0.8643238446466464"
    0,1,1,"0.40559906615287566,0.9848072295736628,0.493888090726854"
    0,4,1,"0.253458867276219,0.9168191778828504,0.47224962583565544"
    0,0,1,"0.3219496197945605,0.3439899477636117,0.7051530741717352"
    0,2,1,"0.692873149428549,0.4770019763881086,0.21937428942781778"
    0,4,0,"0.620118223673067,0.08691420300562658,0.86573472329756"
    0,2,1,"0.00743445923710373,0.5251800239734318,0.054016385555202384"
    0,4,1,"0.6776417760682221,0.7291568018841328,0.4523600060547709"
    1,1,3,"0.6375445528248924,0.04878384701995819,0.4081642382536248"
    1,0,4,"0.776002616178397,0.8851294998284638,0.7321742043493028"
    1,1,0,"0.0928555079874982,0.6156748364694707,0.6985674921582508"
    1,0,2,"0.31328748118329997,0.8326121496142408,0.04133991340612775"
    1,1,0,"0.36786902637778773,0.39161865931662243,0.9971749359397111"
    1,1,1,"0.4647410679872376,0.8478810655406659,0.6746269314422184"
    1,0,2,"0.8117650553546695,0.7893727601272978,0.41527155506593394"
    1,1,3,"0.40707309111756307,0.2796588354307046,0.34846782265758314"
    1,1,0,"0.18626464175355095,0.3523777809254057,0.7863421810531344"
    1,3,0,"0.28357022069634585,0.13774964202156292,0.5913335505943637"

``nodes.csv``:

.. code::

    graph_id,node_id,feat
    0,0,"0.5725330322207948,0.8451870383322376,0.44412796119211184"
    0,1,"0.6624186423087752,0.6118386331195641,0.7352138669985214"
    0,2,"0.7583372765843964,0.15218126307872892,0.6810484348765842"
    0,3,"0.14627522432017592,0.7457985352827006,0.1037097085190507"
    0,4,"0.49037522512771525,0.8778998699783784,0.0911194482288028"
    1,0,"0.11158102039672668,0.08543289788089736,0.6901745368284345"
    1,1,"0.28367647637469273,0.07502571020414439,0.01217200152200748"
    1,2,"0.2472495901894738,0.24285506608575758,0.6494437360242048"
    1,3,"0.5614197853127827,0.059172654879085296,0.4692371689047904"
    1,4,"0.17583413999295983,0.5191278830882644,0.8453123358491914"

RhettYing's avatar
RhettYing committed
357
358
359
The ``graphs.csv`` contains a ``graph_id`` column and arbitrary number of feature columns.
The example dataset here has two graphs, each with a ``feat`` and a ``label`` graph-level
data.
RhettYing's avatar
RhettYing committed
360
361
362
363
364
365
366
367
368
369
370

.. code::

    graph_id,feat,label
    0,"0.7426272601929126,0.5197462471155317,0.8149104951283953",0
    1,"0.534822233529295,0.2863627767733977,0.1154897249106891",0

After loaded, the dataset has multiple homographs with features and labels:

.. code:: python

RhettYing's avatar
RhettYing committed
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
    >>> import dgl
    >>> dataset = dgl.data.CSVDataset('./mini_multi_dataset')
    >>> print(len(dataset))
    2
    >>> graph0, data0 = dataset[0]
    >>> print(graph0)
    Graph(num_nodes=5, num_edges=10,
          ndata_schemes={'feat': Scheme(shape=(3,), dtype=torch.float64)}
          edata_schemes={'feat': Scheme(shape=(3,), dtype=torch.float64)})
    >>> print(data0)
    {'feat': tensor([0.7426, 0.5197, 0.8149]), 'label': tensor([0])}
    >>> graph1, data1 = dataset[1]
    >>> print(graph1)
    Graph(num_nodes=5, num_edges=10,
          ndata_schemes={'feat': Scheme(shape=(3,), dtype=torch.float64)}
          edata_schemes={'feat': Scheme(shape=(3,), dtype=torch.float64)})
    >>> print(data1)
    {'feat': tensor([0.5348, 0.2864, 0.1155]), 'label': tensor([0])}
RhettYing's avatar
RhettYing committed
389

RhettYing's avatar
RhettYing committed
390
.. note::
RhettYing's avatar
RhettYing committed
391

RhettYing's avatar
RhettYing committed
392
    When there are multiple graphs, ``CSVDataset`` currently requires them to be homogeneous.
RhettYing's avatar
RhettYing committed
393
394


RhettYing's avatar
RhettYing committed
395
396
Custom Data Parser
~~~~~~~~~~~~~~~~~~
RhettYing's avatar
RhettYing committed
397

RhettYing's avatar
RhettYing committed
398
399
400
401
402
403
404
By default, ``CSVDataset`` assumes that all the stored node-/edge-/graph- level data are numerical
values. Users can provide custom ``DataParser`` to ``CSVDataset`` to handle more complex
data type. A ``DataParser`` needs to implement the ``__call__`` method which takes in the
:class:`pandas.DataFrame` object created from CSV file and should return a dictionary of
parsed feature data. The parsed feature data will be saved to the ``ndata`` and ``edata`` of
the corresponding ``DGLGraph`` object, and thus must be tensors or numpy arrays. Below shows an example
``DataParser`` which converts string type labels to integers:
RhettYing's avatar
RhettYing committed
405

RhettYing's avatar
RhettYing committed
406
Given a dataset as follows,
RhettYing's avatar
RhettYing committed
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451

.. code::

    ./customized_parser_dataset/
    |-- meta.yaml
    |-- nodes.csv
    |-- edges.csv

``meta.yaml``:

.. code:: yaml

    dataset_name: customized_parser_dataset
    edge_data:
    - file_name: edges.csv
    node_data:
    - file_name: nodes.csv

``edges.csv``:

.. code::

    src_id,dst_id,label
    4,0,positive
    4,0,negative
    0,3,positive
    0,1,positive
    0,2,negative
    0,0,positive
    2,2,negative
    1,0,positive
    3,0,negative
    4,0,positive

``nodes.csv``:

.. code::

    node_id,label
    0,positive
    1,negative
    2,positive
    3,negative
    4,positive

RhettYing's avatar
RhettYing committed
452
To parse the string type labels, one can define a ``DataParser`` class as follows:
RhettYing's avatar
RhettYing committed
453
454
455

.. code:: python

RhettYing's avatar
RhettYing committed
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
    import numpy as np
    import pandas as pd

    class MyDataParser:
        def __call__(self, df: pd.DataFrame):
            parsed = {}
            for header in df:
                if 'Unnamed' in header:  # Handle Unnamed column
                    print("Unamed column is found. Ignored...")
                    continue
                dt = df[header].to_numpy().squeeze()
                if header == 'label':
                    dt = np.array([1 if e == 'positive' else 0 for e in dt])
                parsed[header] = dt
            return parsed

 Create a ``CSVDataset`` using the defined ``DataParser``:

.. code:: python

    >>> dataset = dgl.data.CSVDataset('./customized_parser_dataset',
    ...                               ndata_parser=MyDataParser(),
    ...                               edata_parser=MyDataParser())
    >>> print(dataset[0].ndata['label'])
    tensor([1, 0, 1, 0, 1])
    >>> print(dataset[0].edata['label'])
    tensor([1, 0, 1, 1, 0, 1, 0, 1, 0, 1])

.. note::

    To specify different ``DataParser`` s for different node/edge types, pass a dictionary to
    ``ndata_parser`` and ``edata_parser``, where the key is type name (a single string for
    node type; a string triplet for edge type) and the value is the ``DataParser`` to use.


Full YAML Specification
~~~~~~~~~~~~~~~~~~~~~~~

``CSVDataset`` allows more flexible control over the loading and parsing process. For example, one
can change the ID column names via ``meta.yaml``. The example below lists all the supported keys.

.. code:: yaml

    version: 1.0.0
    dataset_name: some_complex_data
    separator: ','                   # CSV separator symbol. Default: ','
    edge_data:
    - file_name: edges_0.csv
      etype: [user, follow, user]
      src_id_field: src_id           # Column name for source node IDs. Default: src_id
      dst_id_field: dst_id           # Column name for destination node IDs. Default: dst_id
    - file_name: edges_1.csv
      etype: [user, like, item]
      src_id_field: src_id
      dst_id_field: dst_id
    node_data:
    - file_name: nodes_0.csv
      ntype: user
      node_id_field: node_id         # Column name for node IDs. Default: node_id
    - file_name: nodes_1.csv
      ntype: item
      node_id_field: node_id         # Column name for node IDs. Default: node_id
    graph_data:
      file_name: graphs.csv
      graph_id_field: graph_id       # Column name for graph IDs. Default: graph_id

Top-level
^^^^^^^^^^^^^^

At the top level, only 6 keys are available:

  - ``version``: Optional. String.
    It specifies which version of ``meta.yaml`` is used. More feature may be added in the future.
  - ``dataset_name``: Required. String.
    It specifies the dataset name.
  - ``separator``: Optional. String.
    It specifies how to parse data in CSV files. Default: ``','``.
  - ``edge_data``: Required. List of ``EdgeData``.
    Meta data for parsing edge CSV files.
  - ``node_data``: Required. List of ``NodeData``.
    Meta data for parsing node CSV files.
  - ``graph_data``: Optional. ``GraphData``.
    Meta data for parsing the graph CSV file.

``EdgeData``
^^^^^^^^^^^^^^^^^^^^^^

There are 4 keys:

  - ``file_name``: Required. String.
    The CSV file to load data from.
  - ``etype``: Optional. List of string.
    Edge type name in string triplet: [source node type, relation type, destination node type].
  - ``src_id_field``: Optional. String.
    Which column to read for source node IDs. Default: ``src_id``.
  - ``dst_id_field``: Optional. String.
    Which column to read for destination node IDs. Default: ``dst_id``.

``NodeData``
^^^^^^^^^^^^^^^^^^^^^^

There are 3 keys:

  - ``file_name``: Required. String.
    The CSV file to load data from.
  - ``ntype``: Optional. String.
    Node type name.
  - ``node_id_field``: Optional. String.
    Which column to read for node IDs. Default: ``node_id``.

``GraphData``
^^^^^^^^^^^^^^^^^^^^^^

There are 2 keys:

  - ``file_name``: Required. String.
    The CSV file to load data from.
  - ``graph_id_field``: Optional. String.
    Which column to read for graph IDs. Default: ``graph_id``.