[BugFix] fix problems in data split (#3082)

* [BugFix] fix problems in data split * fix format problems in docstring * modify statistics to fit in dgl nature Co-authored-by: Quan (Andy) Gan <coin2028@hotmail.com> Co-authored-by: zhjwy9343 <6593865@qq.com>

[BugFix] fix problems in data split (#3082)
* [BugFix] fix problems in data split * fix format problems in docstring * modify statistics to fit in dgl nature Co-authored-by: Quan (Andy) Gan <coin2028@hotmail.com> Co-authored-by: zhjwy9343 <6593865@qq.com>
bb89dee7 · Kay Liu · GitHub · 0d1dcdcd · bb89dee7 · bb89dee7
Unverified Commit bb89dee7 authored Jul 07, 2021 by Kay Liu Committed by GitHub Jul 07, 2021
Hide whitespace changes
Inline Side-by-side

Showing with 65 additions and 48 deletions

python/dgl/data/fakenews.py python/dgl/data/fakenews.py +32 -22

python/dgl/data/fraud.py python/dgl/data/fraud.py +33 -26

No files found.
--- a/python/dgl/data/fakenews.py
+++ b/python/dgl/data/fakenews.py
@@ -19,15 +19,19 @@ class FakeNewsDataset(DGLBuiltinDataset):
    the root node represents the news, the leaf nodes are Twitter users
    who retweeted the root news. Besides, the node features are encoded
    user historical tweets using different pretrained language models:
-        bert: the 768-dimensional node feature composed of Twitter user
-              historical tweets encoded by the bert-as-service
+    - bert: the 768-dimensional node feature composed of Twitter user
-        content: the 310-dimensional node feature composed of a
+    historical tweets encoded by the bert-as-service
-                 300-dimensional “spacy” vector plus a 10-dimensional
-                 “profile” vector
+    - content: the 310-dimensional node feature composed of a
-        profile: the 10-dimensional node feature composed of ten Twitter
+    300-dimensional “spacy” vector plus a 10-dimensional
-                 user profile attributes.
+    “profile” vector
-        spacy: the 300-dimensional node feature composed of Twitter user
-               historical tweets encoded by the spaCy word2vec encoder.
+    - profile: the 10-dimensional node feature composed of ten Twitter
+    user profile attributes.
+    - spacy: the 300-dimensional node feature composed of Twitter user
+    historical tweets encoded by the spaCy word2vec encoder.
    Note: this dataset is for academic use only, and commercial use is prohibited.
@@ -39,27 +43,33 @@ class FakeNewsDataset(DGLBuiltinDataset):
        - Nodes: 41,054
        - Edges: 40,740
        - Classes:
-            Fake: 157
-            Real: 157
+            - Fake: 157
+            - Real: 157
        - Node feature size:
-            bert: 768
-            content: 310
+            - bert: 768
-            profile: 10
+            - content: 310
-            spacy: 300
+            - profile: 10
+            - spacy: 300
        Gossipcop:
-        - Graphs: 5464
+        - Graphs: 5,464
        - Nodes: 314,262
        - Edges: 308,798
        - Classes:
-            Fake: 2732
-            Real: 2732
+            - Fake: 2,732
+            - Real: 2,732
        - Node feature size:
-            bert: 768
-            content: 310
+            - bert: 768
-            profile: 10
+            - content: 310
-            spacy: 300
+            - profile: 10
+            - spacy: 300
    Parameters
    ----------

--- a/python/dgl/data/fraud.py
+++ b/python/dgl/data/fraud.py
@@ -177,15 +177,15 @@ class FraudDataset(DGLBuiltinDataset):
            "must between 0 and 1 (inclusive)."
        N = x.shape[0]
-        index = list(range(N))
+        index = np.arange(N)
        if self.name == 'amazon':
            # 0-3304 are unlabeled nodes
-            index = list(range(3305, N))
+            index = np.arange(3305, N)
-        np.random.RandomState(seed).permutation(index)
+        index = np.random.RandomState(seed).permutation(index)
-        train_idx = index[:int(train_size * N)]
+        train_idx = index[:int(train_size * len(index))]
-        val_idx = index[int(N - val_size * N):]
+        val_idx = index[len(index) - int(val_size * len(index)):]
-        test_idx = index[int(train_size * N):int(N - val_size * N)]
+        test_idx = index[int(train_size * len(index)):len(index) - int(val_size * len(index))]
        train_mask = np.zeros(N, dtype=np.bool)
        val_mask = np.zeros(N, dtype=np.bool)
        test_mask = np.zeros(N, dtype=np.bool)
@@ -202,9 +202,9 @@ class FraudYelpDataset(FraudDataset):
    The Yelp dataset includes hotel and restaurant reviews filtered (spam) and recommended
    (legitimate) by Yelp. A spam review detection task can be conducted, which is a binary
-    classification task. 32 handcrafted features from
+    classification task. 32 handcrafted features from <http://dx.doi.org/10.1145/2783258.2783370>
-    <http://dx.doi.org/10.1145/2783258.2783370> are taken as the raw node features. Reviews
+    are taken as the raw node features. Reviews are nodes in the graph, and three relations are:
-    are nodes in the graph, and three relations are:
        1. R-U-R: it connects reviews posted by the same user
        2. R-S-R: it connects reviews under the same product with the same star rating (1-5 stars)
        3. R-T-R: it connects two reviews under the same product posted in the same month.
@@ -213,13 +213,16 @@ class FraudYelpDataset(FraudDataset):
    - Nodes: 45,954
    - Edges:
-        R-U-R: 49,315
-        R-T-R: 573,616
+        - R-U-R: 98,630
-        R-S-R: 3,402,743
+        - R-T-R: 1,147,232
-        ALL: 3,846,979
+        - R-S-R: 6,805,486
    - Classes:
-        Positive (spam): 6,677
-        Negative (legitimate): 39,277
+        - Positive (spam): 6,677
+        - Negative (legitimate): 39,277
    - Positive-Negative ratio: 1 : 5.9
    - Node feature size: 32
@@ -269,23 +272,27 @@ class FraudAmazonDataset(FraudDataset):
    the raw node features .
    Users are nodes in the graph, and three relations are:
-        1. U-P-U : it connects users reviewing at least one same product
+    1. U-P-U : it connects users reviewing at least one same product
-        2. U-S-U : it connects users having at least one same star rating within one week
+    2. U-S-U : it connects users having at least one same star rating within one week
-        3. U-V-U : it connects users with top 5% mutual review text similarities (measured by
+    3. U-V-U : it connects users with top 5% mutual review text similarities (measured by
-                   TF-IDF) among all users.
+    TF-IDF) among all users.
    Statistics:
    - Nodes: 11,944
    - Edges:
-        U-P-U: 175,608
-        U-S-U: 3,566,479
+        - U-P-U: 351,216
-        U-V-U: 1,036,737
+        - U-S-U: 7,132,958
-        ALL: 4,398,392
+        - U-V-U: 2,073,474
    - Classes:
-        Positive (fraudulent): 821
-        Negative (benign): 11,123
+        - Positive (fraudulent): 821
-    - Positive-Negative ratio: 1 : 13.5
+        - Negative (benign): 7,818
+        - Unlabeled: 3,305
+    - Positive-Negative ratio: 1 : 10.5
    - Node feature size: 25
    Parameters