README.md 3.09 KB
Newer Older
1
2
3
4
5
6
7
8
9
10
11
12
13
14
# DGL Implementation of the CARE-GNN Paper

This DGL example implements the CAmouflage-REsistant GNN (CARE-GNN) model proposed in the paper [Enhancing Graph Neural Network-based Fraud Detectors against Camouflaged Fraudsters](https://arxiv.org/abs/2008.08692). The author's codes of implementation is [here](https://github.com/YingtongDou/CARE-GNN).

**NOTE**: The sampling version of this model has been modified according to the feature of the DGL's NodeDataLoader. For the formula 2 in the paper, rather than using the embedding of the last layer, this version uses the embedding of the current layer in the previous epoch to measure the similarity between center nodes and their neighbors.

Example implementor
----------------------
This example was implemented by [Kay Liu](https://github.com/kayzliu) during his SDE intern work at the AWS Shanghai AI Lab.

Dependencies
----------------------
- Python 3.7.10
- PyTorch 1.8.1
15
- dgl 0.7.1
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
- scikit-learn 0.23.2

Dataset
---------------------------------------
The datasets used for node classification are DGL's built-in FraudDataset. The statistics are summarized as followings:

**Amazon**

- Nodes: 11,944
- Edges:
    - U-P-U: 351,216
    - U-S-U: 7,132,958
    - U-V-U: 2,073,474
- Classes:
    - Positive (fraudulent): 821
    - Negative (benign): 7,818
    - Unlabeled: 3,305
- Positive-Negative ratio: 1 : 10.5
- Node feature size: 25

**YelpChi**

- Nodes: 45,954
- Edges:
    - R-U-R: 98,630
    - R-T-R: 1,147,232
    - R-S-R: 6,805,486
- Classes:
    - Positive (spam): 6,677
    - Negative (legitimate): 39,277
- Positive-Negative ratio: 1 : 5.9
- Node feature size: 32

How to run
--------------------------------
51
To run the full graph version and use early stopping, in the care-gnn folder, run
52
```
53
python main.py --early-stop
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
```

If want to use a GPU, run
```
python main.py --gpu 0
```

To train on Yelp dataset instead of Amazon, run
```
python main.py --dataset yelp
```

To run the sampling version, run
```
python main_sampling.py
```

Performance
-------------------------
73
The result reported by the paper is the best validation results within 30 epochs, and the table below reports the val and test results (same setting in the paper except for the random seed, here `seed=717`). 
74
75

<table>
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
<thead>
  <tr>
    <th colspan="2">Dataset</th>
    <th>Amazon</th>
    <th>Yelp</th>
  </tr>
</thead>
<tbody>
  <tr>
    <td>Metric (val / test)</td>
    <td>Max Epoch</td>
    <td>30</td>
    <td>30 </td>
  </tr>
  <tr>
    <td rowspan="3">AUC (val/test)</td>
    <td>paper reported</td>
    <td>0.8973 / -</td>
    <td>0.7570 / -</td>
  </tr>
  <tr>
    <td>DGL full graph</td>
    <td>0.8849 / 0.8922</td>
    <td>0.6856 / 0.6867</td>
  </tr>
  <tr>
    <td>DGL sampling</td>
    <td>0.9350 / 0.9331</td>
    <td>0.7857 / 0.7890</td>
  </tr>
  <tr>
    <td rowspan="3">Recall (val/test)</td>
    <td>paper reported</td>
    <td>0.8848 / -</td>
    <td>0.7192 / -</td>
  </tr>
  <tr>
    <td>DGL full graph</td>
    <td>0.8615 / 0.8544</td>
    <td>0.6667/ 0.6619</td>
  </tr>
  <tr>
    <td>DGL sampling</td>
    <td>0.9130 / 0.9045</td>
    <td>0.7537 / 0.7540</td>
  </tr>
</tbody>
123
</table>
124
125
126