Unverified Commit bb89dee7 authored by Kay Liu's avatar Kay Liu Committed by GitHub
Browse files

[BugFix] fix problems in data split (#3082)



* [BugFix] fix problems in data split

* fix format problems in docstring

* modify statistics to fit in dgl nature
Co-authored-by: default avatarQuan (Andy) Gan <coin2028@hotmail.com>
Co-authored-by: default avatarzhjwy9343 <6593865@qq.com>
parent 0d1dcdcd
...@@ -19,15 +19,19 @@ class FakeNewsDataset(DGLBuiltinDataset): ...@@ -19,15 +19,19 @@ class FakeNewsDataset(DGLBuiltinDataset):
the root node represents the news, the leaf nodes are Twitter users the root node represents the news, the leaf nodes are Twitter users
who retweeted the root news. Besides, the node features are encoded who retweeted the root news. Besides, the node features are encoded
user historical tweets using different pretrained language models: user historical tweets using different pretrained language models:
bert: the 768-dimensional node feature composed of Twitter user
historical tweets encoded by the bert-as-service - bert: the 768-dimensional node feature composed of Twitter user
content: the 310-dimensional node feature composed of a historical tweets encoded by the bert-as-service
300-dimensional “spacy” vector plus a 10-dimensional
“profile” vector - content: the 310-dimensional node feature composed of a
profile: the 10-dimensional node feature composed of ten Twitter 300-dimensional “spacy” vector plus a 10-dimensional
user profile attributes. “profile” vector
spacy: the 300-dimensional node feature composed of Twitter user
historical tweets encoded by the spaCy word2vec encoder. - profile: the 10-dimensional node feature composed of ten Twitter
user profile attributes.
- spacy: the 300-dimensional node feature composed of Twitter user
historical tweets encoded by the spaCy word2vec encoder.
Note: this dataset is for academic use only, and commercial use is prohibited. Note: this dataset is for academic use only, and commercial use is prohibited.
...@@ -39,27 +43,33 @@ class FakeNewsDataset(DGLBuiltinDataset): ...@@ -39,27 +43,33 @@ class FakeNewsDataset(DGLBuiltinDataset):
- Nodes: 41,054 - Nodes: 41,054
- Edges: 40,740 - Edges: 40,740
- Classes: - Classes:
Fake: 157
Real: 157 - Fake: 157
- Real: 157
- Node feature size: - Node feature size:
bert: 768
content: 310 - bert: 768
profile: 10 - content: 310
spacy: 300 - profile: 10
- spacy: 300
Gossipcop: Gossipcop:
- Graphs: 5464 - Graphs: 5,464
- Nodes: 314,262 - Nodes: 314,262
- Edges: 308,798 - Edges: 308,798
- Classes: - Classes:
Fake: 2732
Real: 2732 - Fake: 2,732
- Real: 2,732
- Node feature size: - Node feature size:
bert: 768
content: 310 - bert: 768
profile: 10 - content: 310
spacy: 300 - profile: 10
- spacy: 300
Parameters Parameters
---------- ----------
......
...@@ -177,15 +177,15 @@ class FraudDataset(DGLBuiltinDataset): ...@@ -177,15 +177,15 @@ class FraudDataset(DGLBuiltinDataset):
"must between 0 and 1 (inclusive)." "must between 0 and 1 (inclusive)."
N = x.shape[0] N = x.shape[0]
index = list(range(N)) index = np.arange(N)
if self.name == 'amazon': if self.name == 'amazon':
# 0-3304 are unlabeled nodes # 0-3304 are unlabeled nodes
index = list(range(3305, N)) index = np.arange(3305, N)
np.random.RandomState(seed).permutation(index) index = np.random.RandomState(seed).permutation(index)
train_idx = index[:int(train_size * N)] train_idx = index[:int(train_size * len(index))]
val_idx = index[int(N - val_size * N):] val_idx = index[len(index) - int(val_size * len(index)):]
test_idx = index[int(train_size * N):int(N - val_size * N)] test_idx = index[int(train_size * len(index)):len(index) - int(val_size * len(index))]
train_mask = np.zeros(N, dtype=np.bool) train_mask = np.zeros(N, dtype=np.bool)
val_mask = np.zeros(N, dtype=np.bool) val_mask = np.zeros(N, dtype=np.bool)
test_mask = np.zeros(N, dtype=np.bool) test_mask = np.zeros(N, dtype=np.bool)
...@@ -202,9 +202,9 @@ class FraudYelpDataset(FraudDataset): ...@@ -202,9 +202,9 @@ class FraudYelpDataset(FraudDataset):
The Yelp dataset includes hotel and restaurant reviews filtered (spam) and recommended The Yelp dataset includes hotel and restaurant reviews filtered (spam) and recommended
(legitimate) by Yelp. A spam review detection task can be conducted, which is a binary (legitimate) by Yelp. A spam review detection task can be conducted, which is a binary
classification task. 32 handcrafted features from classification task. 32 handcrafted features from <http://dx.doi.org/10.1145/2783258.2783370>
<http://dx.doi.org/10.1145/2783258.2783370> are taken as the raw node features. Reviews are taken as the raw node features. Reviews are nodes in the graph, and three relations are:
are nodes in the graph, and three relations are:
1. R-U-R: it connects reviews posted by the same user 1. R-U-R: it connects reviews posted by the same user
2. R-S-R: it connects reviews under the same product with the same star rating (1-5 stars) 2. R-S-R: it connects reviews under the same product with the same star rating (1-5 stars)
3. R-T-R: it connects two reviews under the same product posted in the same month. 3. R-T-R: it connects two reviews under the same product posted in the same month.
...@@ -213,13 +213,16 @@ class FraudYelpDataset(FraudDataset): ...@@ -213,13 +213,16 @@ class FraudYelpDataset(FraudDataset):
- Nodes: 45,954 - Nodes: 45,954
- Edges: - Edges:
R-U-R: 49,315
R-T-R: 573,616 - R-U-R: 98,630
R-S-R: 3,402,743 - R-T-R: 1,147,232
ALL: 3,846,979 - R-S-R: 6,805,486
- Classes: - Classes:
Positive (spam): 6,677
Negative (legitimate): 39,277 - Positive (spam): 6,677
- Negative (legitimate): 39,277
- Positive-Negative ratio: 1 : 5.9 - Positive-Negative ratio: 1 : 5.9
- Node feature size: 32 - Node feature size: 32
...@@ -269,23 +272,27 @@ class FraudAmazonDataset(FraudDataset): ...@@ -269,23 +272,27 @@ class FraudAmazonDataset(FraudDataset):
the raw node features . the raw node features .
Users are nodes in the graph, and three relations are: Users are nodes in the graph, and three relations are:
1. U-P-U : it connects users reviewing at least one same product 1. U-P-U : it connects users reviewing at least one same product
2. U-S-U : it connects users having at least one same star rating within one week 2. U-S-U : it connects users having at least one same star rating within one week
3. U-V-U : it connects users with top 5% mutual review text similarities (measured by 3. U-V-U : it connects users with top 5% mutual review text similarities (measured by
TF-IDF) among all users. TF-IDF) among all users.
Statistics: Statistics:
- Nodes: 11,944 - Nodes: 11,944
- Edges: - Edges:
U-P-U: 175,608
U-S-U: 3,566,479 - U-P-U: 351,216
U-V-U: 1,036,737 - U-S-U: 7,132,958
ALL: 4,398,392 - U-V-U: 2,073,474
- Classes: - Classes:
Positive (fraudulent): 821
Negative (benign): 11,123 - Positive (fraudulent): 821
- Positive-Negative ratio: 1 : 13.5 - Negative (benign): 7,818
- Unlabeled: 3,305
- Positive-Negative ratio: 1 : 10.5
- Node feature size: 25 - Node feature size: 25
Parameters Parameters
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment