Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in / Register
Toggle navigation
Menu
Open sidebar
OpenDAS
dgl
Commits
8d861d94
"git@developer.sourcefind.cn:renzhc/diffusers_dcu.git" did not exist on "3fe3bc0642cf6ebfa1a815367afd0dc57675ecc7"
Commit
8d861d94
authored
Feb 11, 2022
by
RhettYing
Browse files
refine especiaaly for data parser
parent
e0f054fb
Changes
1
Hide whitespace changes
Inline
Side-by-side
Showing
1 changed file
with
25 additions
and
20 deletions
+25
-20
docs/source/guide/data-loadcsv.rst
docs/source/guide/data-loadcsv.rst
+25
-20
No files found.
docs/source/guide/data-loadcsv.rst
View file @
8d861d94
...
@@ -26,7 +26,7 @@ To create a DGLCSVDataset object:
...
@@ -26,7 +26,7 @@ To create a DGLCSVDataset object:
import dgl
import dgl
ds = dgl.data.DGLCSVDataset('/path/to/dataset')
ds = dgl.data.DGLCSVDataset('/path/to/dataset')
The returned ``ds`` object is as standard :class:`~dgl.da
d
ta.DGLDataset`. For example, if the
The returned ``ds`` object is as standard :class:`~dgl.data.DGLDataset`. For example, if the
dataset is for single-graph node classification, you can use it as:
dataset is for single-graph node classification, you can use it as:
.. code:: python
.. code:: python
...
@@ -42,9 +42,9 @@ Data folder structure
...
@@ -42,9 +42,9 @@ Data folder structure
/path/to/dataset/
/path/to/dataset/
|-- meta.yaml # metadata of the dataset
|-- meta.yaml # metadata of the dataset
|-- edges_0.csv # edge
-level features
|-- edges_0.csv # edge
data including src_id, dst_id, feature, label and so on
|-- ... # you can have as many CSVs for edge data as you want
|-- ... # you can have as many CSVs for edge data as you want
|-- nodes_0.csv # node
-level features
|-- nodes_0.csv # node
data including node_id, feature, label and so on
|-- ... # you can have as many CSVs for node data as you want
|-- ... # you can have as many CSVs for node data as you want
|-- graphs.csv # graph-level features
|-- graphs.csv # graph-level features
...
@@ -60,7 +60,7 @@ Dataset of a single feature-less graph
...
@@ -60,7 +60,7 @@ Dataset of a single feature-less graph
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
When the dataset contains only one graph with no node or edge features, there need only three
When the dataset contains only one graph with no node or edge features, there need only three
files in the data folder: meta.yaml, one CSV for node IDs and one CSV for edges:
files in the data folder:
``
meta.yaml
``
, one CSV for node IDs and one CSV for edges:
.. code::
.. code::
...
@@ -191,11 +191,10 @@ After loaded, the dataset has one graph with features and labels:
...
@@ -191,11 +191,10 @@ After loaded, the dataset has one graph with features and labels:
# edata_schemes={'label': Scheme(shape=(), dtype=torch.int64), 'train_mask': Scheme(shape=(), dtype=torch.bool), 'val_mask': Scheme(shape=(), dtype=torch.bool), 'test_mask': Scheme(shape=(), dtype=torch.bool), 'feat': Scheme(shape=(3,), dtype=torch.float64)})
# edata_schemes={'label': Scheme(shape=(), dtype=torch.int64), 'train_mask': Scheme(shape=(), dtype=torch.bool), 'val_mask': Scheme(shape=(), dtype=torch.bool), 'test_mask': Scheme(shape=(), dtype=torch.bool), 'feat': Scheme(shape=(3,), dtype=torch.float64)})
.. note::
.. note::
All columns will be read and set as edge/node attributes except ``node_id`` in ``nodes.csv``,
All columns will be read
, parsed
and set as edge/node attributes except ``node_id`` in ``nodes.csv``,
``src_id`` and ``dst_id`` in ``edges.csv``. User is able to access directly like: ``g.ndata[‘label’]``.
``src_id`` and ``dst_id`` in ``edges.csv``. User is able to access directly like: ``g.ndata[‘label’]``.
All the data in each row should be either numeric or a list of numeric. As for the list of numerics
The keys in ``g.ndata`` and ``g.edata`` are the same as original column names. Data format is
which probably is the format of ``feat``, it’s a string in a raw CSV cell. Such a string will be
infered automatically during parse.
converted back to a list of numerics when read.
Dataset of a single heterogeneous graph
Dataset of a single heterogeneous graph
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
...
@@ -215,7 +214,9 @@ only 5 files in the data folder: ``meta.yaml``, 2 CSV for nodes and 2 CSV for ed
...
@@ -215,7 +214,9 @@ only 5 files in the data folder: ``meta.yaml``, 2 CSV for nodes and 2 CSV for ed
``meta.yaml``
``meta.yaml``
For heterogeneous graph, ``etype`` and ``ntype`` are MUST HAVE and UNIQUE in ``edge_data`` and
For heterogeneous graph, ``etype`` and ``ntype`` are MUST HAVE and UNIQUE in ``edge_data`` and
``node_data`` respectively, or only the last etype/ntype is kept when generating graph as all
``node_data`` respectively, or only the last etype/ntype is kept when generating graph as all
of them use the same default etype/ntype name.
of them use the same default etype/ntype name. What's more, each node/edge csv file should
contains single and unique ntype/etype. If there exist several ntype/etypes, multiple node/edge
csv files are required.
.. code:: yaml
.. code:: yaml
...
@@ -495,12 +496,13 @@ Optional. String. It specifies which column to be read for graph ids. Default va
...
@@ -495,12 +496,13 @@ Optional. String. It specifies which column to be read for graph ids. Default va
Parse node/edge/grpah data on your own
Parse node/edge/grpah data on your own
~~~~~~~~~~~~~~~~~~~~~~~~
~~~~~~~~~~~~~~~~~~~~~~~~
In default, all the data are attached to ``g.ndata`` with the same key as column name in ``nodes.csv``.
In default, all the data are attached to ``g.ndata`` with the same key as column name in ``nodes.csv``
So does data in ``edges.csv``. Data is not formatted unless it's a string of float values(feature data
except ``node_id``. So does data in ``edges.csv``. Data is auto-formatted via ``pandas`` unless it's
is often of this format). For better experience, user is able to self-define node/edge/graph data parser
a string of float values(feature data is often of this format). For better experience, user is able
and passed such callable instance while instantiating ``DGLCSVDataset``. Below is an example.
to self-define node/edge/graph data parser which is callable and accept ``pandas.DataFrame`` as input
data. Then pass such callable instance while instantiating ``DGLCSVDataset``. Below is an example.
``DataParser``:
``
SelfDefined
DataParser``:
.. code:: python
.. code:: python
...
@@ -518,10 +520,8 @@ and passed such callable instance while instantiating ``DGLCSVDataset``. Below i
...
@@ -518,10 +520,8 @@ and passed such callable instance while instantiating ``DGLCSVDataset``. Below i
print("Unamed column is found. Ignored...")
print("Unamed column is found. Ignored...")
continue
continue
dt = df[header].to_numpy().squeeze()
dt = df[header].to_numpy().squeeze()
print("{},{}".format(header, dt))
if header == 'label':
if header == 'label':
dt = np.array([1 if e == 'positive' else 0 for e in dt])
dt = np.array([1 if e == 'positive' else 0 for e in dt])
print("{},{}".format(header, dt))
data[header] = dt
data[header] = dt
return data
return data
...
@@ -578,7 +578,9 @@ After loaded, the dataset has one graph with features and labels:
...
@@ -578,7 +578,9 @@ After loaded, the dataset has one graph with features and labels:
.. code:: python
.. code:: python
import dgl
import dgl
dataset = dgl.data.DGLCSVDataset('./customized_parser_dataset', node_data_parser={'_V':SelfDefinedDataParser()}, edge_data_parser={('_V','_E','_V'):SelfDefinedDataParser()})
dataset = dgl.data.DGLCSVDataset('./customized_parser_dataset',
node_data_parser={'_V': SelfDefinedDataParser()},
edge_data_parser={('_V','_E','_V'): SelfDefinedDataParser()})
print(dataset[0].ndata['label'])
print(dataset[0].ndata['label'])
#tensor([1, 0, 1, 0, 1])
#tensor([1, 0, 1, 0, 1])
print(dataset[0].edata['label'])
print(dataset[0].edata['label'])
...
@@ -590,9 +592,12 @@ FAQs:
...
@@ -590,9 +592,12 @@ FAQs:
What's the data type in CSV files?
What's the data type in CSV files?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
All data is required to be numeric. Specifically, all data except features should be ``integer``.
A default data parser is used for parsing node/edge/graph csv files in default which infer data type
For Feature, it’s a ``string`` composed of ``float`` values. Such strings will be splitted and cast
automatically. ID related data such as ``node_id``, ``src_id``, ``dst_id``, ``graph_id`` are required
into float values.
to be ``numeric`` as these fields are used for constructing graph. Any other data will be attached to
``g.ndata`` or ``g.edata`` directly, so it's user's responsibility to make sure the data type is expected
when using within graph. In particular, ``string`` data which is composed of ``float`` values is splitted
and cast into float value array by default data parser.
What if some lines in CSV have missing values in several fields?
What if some lines in CSV have missing values in several fields?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
...
...
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment