Unverified Commit bc2fef9c authored by Mufei Li's avatar Mufei Li Committed by GitHub
Browse files

[Doc] Misc Fix for User Guide 7.1 Data Preprocessing (#4433)



* Update

* rollback for partition_algo/random.py
Co-authored-by: default avatarUbuntu <ubuntu@ip-172-31-20-21.us-west-2.compute.internal>
parent d248e768
...@@ -20,7 +20,7 @@ training. For example, ...@@ -20,7 +20,7 @@ training. For example,
import dgl import dgl
g = ... # create or load an DGLGraph object g = ... # create or load a DGLGraph object
dgl.distributed.partition_graph(g, 'mygraph', 2, 'data_root_dir') dgl.distributed.partition_graph(g, 'mygraph', 2, 'data_root_dir')
will outputs the following data file. will outputs the following data file.
...@@ -243,7 +243,7 @@ strict requirement as long as ``metadata.json`` contains valid file paths. ...@@ -243,7 +243,7 @@ strict requirement as long as ``metadata.json`` contains valid file paths.
in each chunk. in each chunk.
* ``edge_type``: List of string. Edge type names in the form of * ``edge_type``: List of string. Edge type names in the form of
``<source node type>:<relation>:<destination node type>``. ``<source node type>:<relation>:<destination node type>``.
* ``num_edges_per_chunk``: List of list of integer. For graphs with :math:`R` edge * ``num_edges_per_chunk``: List of list of integer. For graphs with :math:`R` edge
types stored in :math:`P` chunks, the value contains :math:`R` integer lists. types stored in :math:`P` chunks, the value contains :math:`R` integer lists.
Each list contains :math:`P` integers, which specify the number of edges Each list contains :math:`P` integers, which specify the number of edges
in each chunk. in each chunk.
...@@ -262,8 +262,7 @@ strict requirement as long as ``metadata.json`` contains valid file paths. ...@@ -262,8 +262,7 @@ strict requirement as long as ``metadata.json`` contains valid file paths.
details about how to parse each data file. details about how to parse each data file.
- ``"csv"``: CSV file. Use the ``delimiter`` key to specify delimiter in use. - ``"csv"``: CSV file. Use the ``delimiter`` key to specify delimiter in use.
- ``"numpy"``: NumPy array binary file created by :func:`numpy.save`. - ``"numpy"``: NumPy array binary file created by :func:`numpy.save`.
* ``data``: List of string. File path to each data chunk. Support absolute path * ``data``: List of string. File path to each data chunk. Support absolute path.
or path relative to the location of ``metadata.json``.
Tips for making chunked graph data Tips for making chunked graph data
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
...@@ -297,9 +296,9 @@ For example, to randomly partition MAG240M-LSC to two parts, run the ...@@ -297,9 +296,9 @@ For example, to randomly partition MAG240M-LSC to two parts, run the
.. code-block:: bash .. code-block:: bash
python /my/repo/dgl/tools/partition_algo/random.py python /my/repo/dgl/tools/partition_algo/random.py
--in-dir=/mydata/MAG240M-LSC_chunked/ --metadata /mydata/MAG240M-LSC_chunked/metadata.json
--out-dir=/mydata/MAG240M-LSC_2parts/ --output_path /mydata/MAG240M-LSC_2parts/
--num-parts=2 --num_partitions 2
, which outputs files as follows: , which outputs files as follows:
...@@ -345,13 +344,13 @@ efficiently. The entire step can be further accelerated using multi-processing. ...@@ -345,13 +344,13 @@ efficiently. The entire step can be further accelerated using multi-processing.
.. code-block:: bash .. code-block:: bash
python /myrepo/dgl/tools/dispatch_data.py \ python /myrepo/dgl/tools/dispatch_data.py \
--in-dir=/mydata/MAG240M-LSC_chunked/ \ --in-dir /mydata/MAG240M-LSC_chunked/ \
--partition-file=/mydata/MAG240M-LSC_2parts/ \ --partitions-dir /mydata/MAG240M-LSC_2parts/ \
--out-dir=/data/MAG_LSC_partitioned \ --out-dir data/MAG_LSC_partitioned \
--ip-config=ip_config.txt --ip-config ip_config.txt
* ``--in-dir`` specifies the path to the folder of the input chunked graph data produced by Step.1. * ``--in-dir`` specifies the path to the folder of the input chunked graph data produced
* ``--partition-file`` specifies the path to the partition assignment file produced by Step.2. * ``--partitions-dir`` specifies the path to the partition assignment folder produced by Step.1.
* ``--out-dir`` specifies the path to stored the data partition on each machine. * ``--out-dir`` specifies the path to stored the data partition on each machine.
* ``--ip-config`` specifies the IP configuration file of the cluster. * ``--ip-config`` specifies the IP configuration file of the cluster.
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment