README.md 7.09 KB
Newer Older
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
# DGL Utility Scripts

This folder contains the utilities that do not belong to DGL core package as standalone executable
scripts.

## Graph Chunking

`chunk_graph.py` provides an example of chunking an existing DGLGraph object into the on-disk
[chunked graph format](http://13.231.216.217/guide/distributed-preprocessing.html#chunked-graph-format).

<!-- TODO: change the link of documentation once it's merged to master -->

An example of chunking the OGB MAG240M dataset:

```python
import ogb.lsc

dataset = ogb.lsc.MAG240MDataset('.')
etypes = [
    ('paper', 'cites', 'paper'),
    ('author', 'writes', 'paper'),
    ('author', 'affiliated_with', 'institution')]
g = dgl.heterograph({k: tuple(dataset.edge_index(*k)) for k in etypes})
chunk_graph(
    g,
    'mag240m',
    {'paper': {
        'feat': 'mag240m_kddcup2021/processed/paper/node_feat.npy',
        'label': 'mag240m_kddcup2021/processed/paper/node_label.npy',
        'year': 'mag240m_kddcup2021/processed/paper/node_year.npy'}},
    {},
    4,
    'output')
```

The output chunked graph metadata will go as follows (assuming the current directory as
`/home/user`:

```json
{
    "graph_name": "mag240m",
    "node_type": [
        "author",
        "institution",
        "paper"
    ],
    "num_nodes_per_chunk": [
        [
            30595778,
            30595778,
            30595778,
            30595778
        ],
        [
            6431,
            6430,
            6430,
            6430
        ],
        [
            30437917,
            30437917,
            30437916,
            30437916
        ]
    ],
    "edge_type": [
        "author:affiliated_with:institution",
        "author:writes:paper",
        "paper:cites:paper"
    ],
    "num_edges_per_chunk": [
        [
            11148147,
            11148147,
            11148146,
            11148146
        ],
        [
            96505680,
            96505680,
            96505680,
            96505680
        ],
        [
            324437232,
            324437232,
            324437231,
            324437231
        ]
    ],
    "edges": {
        "author:affiliated_with:institution": {
            "format": {
                "name": "csv",
                "delimiter": " "
            },
            "data": [
                "/home/user/output/edge_index/author:affiliated_with:institution0.txt",
                "/home/user/output/edge_index/author:affiliated_with:institution1.txt",
                "/home/user/output/edge_index/author:affiliated_with:institution2.txt",
                "/home/user/output/edge_index/author:affiliated_with:institution3.txt"
            ]
        },
        "author:writes:paper": {
            "format": {
                "name": "csv",
                "delimiter": " "
            },
            "data": [
                "/home/user/output/edge_index/author:writes:paper0.txt",
                "/home/user/output/edge_index/author:writes:paper1.txt",
                "/home/user/output/edge_index/author:writes:paper2.txt",
                "/home/user/output/edge_index/author:writes:paper3.txt"
            ]
        },
        "paper:cites:paper": {
            "format": {
                "name": "csv",
                "delimiter": " "
            },
            "data": [
                "/home/user/output/edge_index/paper:cites:paper0.txt",
                "/home/user/output/edge_index/paper:cites:paper1.txt",
                "/home/user/output/edge_index/paper:cites:paper2.txt",
                "/home/user/output/edge_index/paper:cites:paper3.txt"
            ]
        }
    },
    "node_data": {
        "paper": {
            "feat": {
                "format": {
                    "name": "numpy"
                },
                "data": [
                    "/home/user/output/node_data/paper/feat-0.npy",
                    "/home/user/output/node_data/paper/feat-1.npy",
                    "/home/user/output/node_data/paper/feat-2.npy",
                    "/home/user/output/node_data/paper/feat-3.npy"
                ]
            },
            "label": {
                "format": {
                    "name": "numpy"
                },
                "data": [
                    "/home/user/output/node_data/paper/label-0.npy",
                    "/home/user/output/node_data/paper/label-1.npy",
                    "/home/user/output/node_data/paper/label-2.npy",
                    "/home/user/output/node_data/paper/label-3.npy"
                ]
            },
            "year": {
                "format": {
                    "name": "numpy"
                },
                "data": [
                    "/home/user/output/node_data/paper/year-0.npy",
                    "/home/user/output/node_data/paper/year-1.npy",
                    "/home/user/output/node_data/paper/year-2.npy",
                    "/home/user/output/node_data/paper/year-3.npy"
                ]
            }
        }
    },
    "edge_data": {}
}
```
170
171
172
173
174
175
176
177

## Change edge type to canonical edge type for partition configuration json

In the upcoming DGL v1.0, we will require the partition configuration file to contain only canonical edge type. This tool is designed to help migrating existing configuration files from old style to new one.

### Sample Usage

```
Rhett Ying's avatar
Rhett Ying committed
178
python tools/change_etype_to_canonical_etype.py --part_config "{configuration file path}"
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
```

### Requirement

Partition algorithms produce one configuration file and multiple data folders, and each data folder corresponds to a partition. **This tool needs to read from the partition configuration file (specified by the commandline argument) *and* the graph structure data (stored in `graph.dgl` under the data folder) of the first partition.** They can be local files or shared files among network, if you follow this [official tutorial](https://docs.dgl.ai/en/latest/tutorials/dist/1_node_classification.html#sphx-glr-tutorials-dist-1-node-classification-py) for distributed training, you don't need to care about this as all files are shared by every participant through NFS.

**For example, below is a typical data folder expected by this tool:**
```
data_root_dir/
|-- graph_name.json    # specified by part_config
|-- part0/
    ...
    |-- graph.dgl
...
```

For more information about partition algorithm, see https://docs.dgl.ai/en/latest/generated/dgl.distributed.partition.partition_graph.html.

### Input arguments

Rhett Ying's avatar
Rhett Ying committed
199
1. *part_config*: The path of partition json file. < **Required**>
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236

### Result

This tool changes the key of ``etypes`` and ``edge_map`` from format ``str`` to ``str:str:str`` and it overwrites the original file instead of creating a new one.

E.g. **File content before running the script**
```json
{
    "edge_map": {
        "r1": [ [ 0, 6 ], [ 16, 20 ] ],
        "r2": [ [ 6, 11 ], [ 20, 25 ] ],
        "r3": [ [ 11, 16 ], [ 25, 30 ] ]
    },
    "etypes": {
        "r1": 0,
        "r2": 1,
        "r3": 2
    },
    ...
}
```

**After running**
```json
{
    "edge_map": {
        "n1:r1:n2": [ [ 0, 6 ], [ 16, 20 ] ],
        "n1:r2:n3": [ [ 6, 11 ], [ 20, 25 ] ],
        "n2:r3:n3": [ [ 11, 16 ], [ 25, 30 ] ] },
    "etypes": {
        "n1:r1:n2": 0,
        "n1:r2:n3": 1,
        "n2:r3:n3": 2
    }
    ...
}
```