[Doc] Fix the user guide for distributed partitioning. (#2684)

* fix doc. * explain the schema file. * fix.

[Doc] Fix the user guide for distributed partitioning. (#2684)
* fix doc. * explain the schema file. * fix.
48a1794f · Da Zheng · GitHub · 80c26877 · 48a1794f
Unverified Commit 48a1794f authored Feb 22, 2021 by Da Zheng Committed by GitHub Feb 22, 2021
Hide whitespace changes
Inline Side-by-side

Showing with 97 additions and 13 deletions

docs/source/guide/distributed-preprocessing.rst docs/source/guide/distributed-preprocessing.rst +97 -13

No files found.
--- a/docs/source/guide/distributed-preprocessing.rst
+++ b/docs/source/guide/distributed-preprocessing.rst
@@ -90,14 +90,24 @@ a graph in a cluster of machines. This solution requires users to prepare data f
 ParMETIS Installation
 ^^^^^^^^^^^^^^^^^^^^^

+ParMETIS requires METIS and GKLib. Please follow the instructions `here <https://github.com/KarypisLab/GKlib>`__
+to compile and install GKLib. For compiling and install METIS, please follow the instructions below to
+clone METIS with GIT and compile it with int64 support.
+
+.. code-block:: none
+
+    git clone https://github.com/KarypisLab/METIS.git
+    make config shared=1 cc=gcc prefix=~/local i64=1
+    make install
+
+
 For now, we need to compile and install ParMETIS manually. We clone the DGL branch of ParMETIS as follows:

 .. code-block:: none

    git clone --branch dgl https://github.com/KarypisLab/ParMETIS.git

-Then we follow the instructions in its Github to install its dependencies including METIS
-and then compile and install ParMETIS.
+Then compile and install ParMETIS.

 .. code-block:: none

@@ -172,6 +182,8 @@ All fields are separated by whitespace:
 * `<attributes>` are optional fields. They can be used to store any values and ParMETIS does not
  interpret these fields.

+**Note**: please make sure that there are no duplicated edges and self-loop edges in the edge file.
+
 `xxx_stats.txt` stores some basic statistics of the graph. It has only one line with three fields
 separated by whitespace:

@@ -225,17 +237,8 @@ an edge with the following fields:
 * `<attributes>` are optional fields that contain any edge attributes in the input edge file.

 When invoking `pm_dglpart`, the three input files: `xxx_nodes.txt`, `xxx_edges.txt`, `xxx_stats.txt`
-should be located in the directory where `pm_dglpart` runs.  The following command partitions the graph
-named `xxx` into two partitions.
-
-.. code-block:: none
-
-    pm_dglpart xxx 2
-
-The following command partitions the graph named `xxx` into  eight partitions. In this case,
-the three input files: `xxx_nodes.txt`, `xxx_edges.txt`, `xxx_stats.txt` should still be located
-in the directory where `pm_dglpart` runs. **Note**: the command actually splits the input graph
-into eight partitions.
+should be located in the directory where `pm_dglpart` runs. The following command run four ParMETIS
+processes to partition the graph named `xxx` into eight partitions (each process handles two partitions).

 .. code-block:: none

@@ -255,6 +258,8 @@ for loading data in csv files.
 * `--input-dir INPUT_DIR` specifies the directory that contains the partition files generated by ParMETIS.
 * `--graph-name GRAPH_NAME` specifies the graph name.
 * `--schema SCHEMA` provides a file that specifies the schema of the input heterogeneous graph.
+  The schema file is a JSON file that lists node types and edge types as well as homogeneous ID ranges
+  for each node type and edge type.
 * `--num-parts NUM_PARTS` specifies the number of partitions.
 * `--num-node-weights NUM_NODE_WEIGHTS` specifies the number of node weights used by ParMETIS
  to balance partitions.
@@ -286,6 +291,85 @@ assumes all nodes/edges of any types have exactly these attributes. Therefore, i
 nodes or edges of different types contain different numbers of attributes, users need to construct
 them manually.

+Below shows an example of the schema of the OGBN-MAG graph for `convert_partition.py`. It has two fields:
+"nid" and "eid". Inside "nid", it lists all node types and the homogeneous ID ranges for each node type;
+inside "eid", it lists all edge types and the homogeneous ID ranges for each edge type.
+
+.. code-block:: none
+
+    {
+    "nid": {
+        "author": [
+            0,
+            1134649
+        ],
+        "field_of_study": [
+            1134649,
+            1194614
+        ],
+        "institution": [
+            1194614,
+            1203354
+        ],
+        "paper": [
+            1203354,
+            1939743
+        ]
+    },
+    "eid": {
+        "affiliated_with": [
+            0,
+            1043998
+        ],
+        "writes": [
+            1043998,
+            8189658
+        ],
+        "rev-has_topic": [
+            8189658,
+            15694736
+        ],
+        "rev-affiliated_with": [
+            15694736,
+            16738734
+        ],
+        "cites": [
+            16738734,
+            22155005
+        ],
+        "has_topic": [
+            22155005,
+            29660083
+        ],
+        "rev-cites": [
+            29660083,
+            35076354
+        ],
+        "rev-writes": [
+            35076354,
+            42222014
+        ]
+    }
+    }
+
+Below shows the demo code to construct the schema file.
+
+.. code-block:: none
+
+    nid_ranges = {}
+    eid_ranges = {}
+    for ntype in hg.ntypes:
+        ntype_id = hg.get_ntype_id(ntype)
+        nid = th.nonzero(g.ndata[dgl.NTYPE] == ntype_id, as_tuple=True)[0]
+        nid_ranges[ntype] = [int(nid[0]), int(nid[-1] + 1)]
+
+    for etype in hg.etypes:
+        etype_id = hg.get_etype_id(etype)
+        eid = th.nonzero(g.edata[dgl.ETYPE] == etype_id, as_tuple=True)[0]
+        eid_ranges[etype] = [int(eid[0]), int(eid[-1] + 1)]
+    with open('mag.json', 'w') as outfile:
+        json.dump({'nid': nid_ranges, 'eid': eid_ranges}, outfile, indent=4)
+
 Construct node/edge features for a heterogeneous graph
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^