README.md 2.85 KB
Newer Older
1
2
# Node classification on heterogeneous graph with RGCN

3
4
This example aims to demonstrate how to run node classification task on heterogeneous graph with **GraphBolt**. Models are not tuned to achieve the best accuracy yet.

5
6
## Run on `ogbn-mag` dataset

7
8
9
10
11
12
### Sample on CPU and train/infer on CPU
```
python3 hetero_rgcn.py --dataset ogbn-mag
```

### Sample on CPU and train/infer on GPU
13
```
14
python3 hetero_rgcn.py --dataset ogbn-mag --num_gpus 1
15
16
```

17
### Resource usage and time cost
18
Below results are roughly collected from an AWS EC2 **g4dn.metal**, 384GB RAM, 96 vCPUs(Cascade Lake P-8259L), 8 NVIDIA T4 GPUs(16GB RAM). CPU RAM usage is the peak value of `used` field of `free` command which is a bit rough. Please refer to `RSS`/`USS`/`PSS` which are more accurate. GPU RAM usage is the peak value recorded by `nvidia-smi` command.
19

20
21
| Dataset Size | CPU RAM Usage | Num of GPUs | GPU RAM Usage | Time Per Epoch(Training) |
| ------------ | ------------- | ----------- | ------------- | ------------------------ |
22
23
| ~1.1GB       | ~5.3GB        | 0           |  0GB          | ~230s                    |
| ~1.1GB       | ~3GB          | 1           |  3.87GB       | ~64.6s                   |
24

25
### Accuracies
26
```
27
28
29
30
Epoch: 01, Loss: 2.3434, Valid accuracy: 48.23%
Epoch: 02, Loss: 1.5646, Valid accuracy: 48.49%
Epoch: 03, Loss: 1.1633, Valid accuracy: 45.79%
Test accuracy 44.6792
31
32
33
34
```

## Run on `ogb-lsc-mag240m` dataset

35
### Sample on CPU and train/infer on CPU
36
```
37
python3 hetero_rgcn.py --dataset ogb-lsc-mag240m
38
39
```

40
41
### Sample on CPU and train/infer on GPU
```
42
python3 hetero_rgcn.py --dataset ogb-lsc-mag240m --num_gpus 1
43
```
44

45
### Resource usage and time cost
46
Below results are roughly collected from an AWS EC2 **g4dn.metal**, 384GB RAM, 96 vCPUs(Cascade Lake P-8259L), 8 NVIDIA T4 GPUs(16GB RAM). CPU RAM usage is the peak value of `used` field of `free` command which is a bit rough. Please refer to `RSS`/`USS`/`PSS` which are more accurate. GPU RAM usage is the peak value recorded by `nvidia-smi` command.
47

48
49
> **note:**
`buffer/cache` are highly used during train, it's about 300GB. If more RAM is available, more `buffer/cache` will be consumed as graph size is about 55GB and feature data is about 350GB.
50
One more thing, first epoch is quite slow as `buffer/cache` is not ready yet. For GPU train, first epoch takes **1030s**.
51
52
Even in following epochs, time consumption varies.

53
54
| Dataset Size | CPU RAM Usage | Num of GPUs | GPU RAM Usage | Time Per Epoch(Training) |
| ------------ | ------------- | ----------- | ------------- | ------------------------ |
55
56
| ~404GB       | ~67GB         | 0           |  0GB          | ~248s                    |
| ~404GB       | ~60GB         | 1           |  15GB         | ~166s                    |
57

58
### Accuracies
59
```
60
61
62
Epoch: 01, Loss: 2.1432, Valid accuracy: 50.21%
Epoch: 02, Loss: 1.9267, Valid accuracy: 50.77%
Epoch: 03, Loss: 1.8797, Valid accuracy: 53.38%
Rhett Ying's avatar
Rhett Ying committed
63
```