Unverified Commit bc2fef9c authored by Mufei Li's avatar Mufei Li Committed by GitHub
Browse files

[Doc] Misc Fix for User Guide 7.1 Data Preprocessing (#4433)



* Update

* rollback for partition_algo/random.py
Co-authored-by: default avatarUbuntu <ubuntu@ip-172-31-20-21.us-west-2.compute.internal>
parent d248e768
......@@ -20,7 +20,7 @@ training. For example,
import dgl
g = ... # create or load an DGLGraph object
g = ... # create or load a DGLGraph object
dgl.distributed.partition_graph(g, 'mygraph', 2, 'data_root_dir')
will outputs the following data file.
......@@ -243,7 +243,7 @@ strict requirement as long as ``metadata.json`` contains valid file paths.
in each chunk.
* ``edge_type``: List of string. Edge type names in the form of
``<source node type>:<relation>:<destination node type>``.
* ``num_edges_per_chunk``: List of list of integer. For graphs with :math:`R` edge
* ``num_edges_per_chunk``: List of list of integer. For graphs with :math:`R` edge
types stored in :math:`P` chunks, the value contains :math:`R` integer lists.
Each list contains :math:`P` integers, which specify the number of edges
in each chunk.
......@@ -262,8 +262,7 @@ strict requirement as long as ``metadata.json`` contains valid file paths.
details about how to parse each data file.
- ``"csv"``: CSV file. Use the ``delimiter`` key to specify delimiter in use.
- ``"numpy"``: NumPy array binary file created by :func:`numpy.save`.
* ``data``: List of string. File path to each data chunk. Support absolute path
or path relative to the location of ``metadata.json``.
* ``data``: List of string. File path to each data chunk. Support absolute path.
Tips for making chunked graph data
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
......@@ -297,9 +296,9 @@ For example, to randomly partition MAG240M-LSC to two parts, run the
.. code-block:: bash
python /my/repo/dgl/tools/partition_algo/random.py
--in-dir=/mydata/MAG240M-LSC_chunked/
--out-dir=/mydata/MAG240M-LSC_2parts/
--num-parts=2
--metadata /mydata/MAG240M-LSC_chunked/metadata.json
--output_path /mydata/MAG240M-LSC_2parts/
--num_partitions 2
, which outputs files as follows:
......@@ -345,13 +344,13 @@ efficiently. The entire step can be further accelerated using multi-processing.
.. code-block:: bash
python /myrepo/dgl/tools/dispatch_data.py \
--in-dir=/mydata/MAG240M-LSC_chunked/ \
--partition-file=/mydata/MAG240M-LSC_2parts/ \
--out-dir=/data/MAG_LSC_partitioned \
--ip-config=ip_config.txt
--in-dir /mydata/MAG240M-LSC_chunked/ \
--partitions-dir /mydata/MAG240M-LSC_2parts/ \
--out-dir data/MAG_LSC_partitioned \
--ip-config ip_config.txt
* ``--in-dir`` specifies the path to the folder of the input chunked graph data produced by Step.1.
* ``--partition-file`` specifies the path to the partition assignment file produced by Step.2.
* ``--in-dir`` specifies the path to the folder of the input chunked graph data produced
* ``--partitions-dir`` specifies the path to the partition assignment folder produced by Step.1.
* ``--out-dir`` specifies the path to stored the data partition on each machine.
* ``--ip-config`` specifies the IP configuration file of the cluster.
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment