"examples/mxnet/vscode:/vscode.git/clone" did not exist on "b89dcce16cd1acf7c14992b5df681ab76a9b26e9"
Unverified Commit bb89dee7 authored by Kay Liu's avatar Kay Liu Committed by GitHub
Browse files

[BugFix] fix problems in data split (#3082)



* [BugFix] fix problems in data split

* fix format problems in docstring

* modify statistics to fit in dgl nature
Co-authored-by: default avatarQuan (Andy) Gan <coin2028@hotmail.com>
Co-authored-by: default avatarzhjwy9343 <6593865@qq.com>
parent 0d1dcdcd
......@@ -19,14 +19,18 @@ class FakeNewsDataset(DGLBuiltinDataset):
the root node represents the news, the leaf nodes are Twitter users
who retweeted the root news. Besides, the node features are encoded
user historical tweets using different pretrained language models:
bert: the 768-dimensional node feature composed of Twitter user
- bert: the 768-dimensional node feature composed of Twitter user
historical tweets encoded by the bert-as-service
content: the 310-dimensional node feature composed of a
- content: the 310-dimensional node feature composed of a
300-dimensional “spacy” vector plus a 10-dimensional
“profile” vector
profile: the 10-dimensional node feature composed of ten Twitter
- profile: the 10-dimensional node feature composed of ten Twitter
user profile attributes.
spacy: the 300-dimensional node feature composed of Twitter user
- spacy: the 300-dimensional node feature composed of Twitter user
historical tweets encoded by the spaCy word2vec encoder.
Note: this dataset is for academic use only, and commercial use is prohibited.
......@@ -39,27 +43,33 @@ class FakeNewsDataset(DGLBuiltinDataset):
- Nodes: 41,054
- Edges: 40,740
- Classes:
Fake: 157
Real: 157
- Fake: 157
- Real: 157
- Node feature size:
bert: 768
content: 310
profile: 10
spacy: 300
- bert: 768
- content: 310
- profile: 10
- spacy: 300
Gossipcop:
- Graphs: 5464
- Graphs: 5,464
- Nodes: 314,262
- Edges: 308,798
- Classes:
Fake: 2732
Real: 2732
- Fake: 2,732
- Real: 2,732
- Node feature size:
bert: 768
content: 310
profile: 10
spacy: 300
- bert: 768
- content: 310
- profile: 10
- spacy: 300
Parameters
----------
......
......@@ -177,15 +177,15 @@ class FraudDataset(DGLBuiltinDataset):
"must between 0 and 1 (inclusive)."
N = x.shape[0]
index = list(range(N))
index = np.arange(N)
if self.name == 'amazon':
# 0-3304 are unlabeled nodes
index = list(range(3305, N))
index = np.arange(3305, N)
np.random.RandomState(seed).permutation(index)
train_idx = index[:int(train_size * N)]
val_idx = index[int(N - val_size * N):]
test_idx = index[int(train_size * N):int(N - val_size * N)]
index = np.random.RandomState(seed).permutation(index)
train_idx = index[:int(train_size * len(index))]
val_idx = index[len(index) - int(val_size * len(index)):]
test_idx = index[int(train_size * len(index)):len(index) - int(val_size * len(index))]
train_mask = np.zeros(N, dtype=np.bool)
val_mask = np.zeros(N, dtype=np.bool)
test_mask = np.zeros(N, dtype=np.bool)
......@@ -202,9 +202,9 @@ class FraudYelpDataset(FraudDataset):
The Yelp dataset includes hotel and restaurant reviews filtered (spam) and recommended
(legitimate) by Yelp. A spam review detection task can be conducted, which is a binary
classification task. 32 handcrafted features from
<http://dx.doi.org/10.1145/2783258.2783370> are taken as the raw node features. Reviews
are nodes in the graph, and three relations are:
classification task. 32 handcrafted features from <http://dx.doi.org/10.1145/2783258.2783370>
are taken as the raw node features. Reviews are nodes in the graph, and three relations are:
1. R-U-R: it connects reviews posted by the same user
2. R-S-R: it connects reviews under the same product with the same star rating (1-5 stars)
3. R-T-R: it connects two reviews under the same product posted in the same month.
......@@ -213,13 +213,16 @@ class FraudYelpDataset(FraudDataset):
- Nodes: 45,954
- Edges:
R-U-R: 49,315
R-T-R: 573,616
R-S-R: 3,402,743
ALL: 3,846,979
- R-U-R: 98,630
- R-T-R: 1,147,232
- R-S-R: 6,805,486
- Classes:
Positive (spam): 6,677
Negative (legitimate): 39,277
- Positive (spam): 6,677
- Negative (legitimate): 39,277
- Positive-Negative ratio: 1 : 5.9
- Node feature size: 32
......@@ -278,14 +281,18 @@ class FraudAmazonDataset(FraudDataset):
- Nodes: 11,944
- Edges:
U-P-U: 175,608
U-S-U: 3,566,479
U-V-U: 1,036,737
ALL: 4,398,392
- U-P-U: 351,216
- U-S-U: 7,132,958
- U-V-U: 2,073,474
- Classes:
Positive (fraudulent): 821
Negative (benign): 11,123
- Positive-Negative ratio: 1 : 13.5
- Positive (fraudulent): 821
- Negative (benign): 7,818
- Unlabeled: 3,305
- Positive-Negative ratio: 1 : 10.5
- Node feature size: 25
Parameters
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment