[BugFix] fix problems in data split (#3082)

* [BugFix] fix problems in data split * fix format problems in docstring * modify statistics to fit in dgl nature Co-authored-by: Quan (Andy) Gan <coin2028@hotmail.com> Co-authored-by: zhjwy9343 <6593865@qq.com>

[BugFix] fix problems in data split (#3082)
* [BugFix] fix problems in data split * fix format problems in docstring * modify statistics to fit in dgl nature Co-authored-by: Quan (Andy) Gan <coin2028@hotmail.com> Co-authored-by: zhjwy9343 <6593865@qq.com>
bb89dee7 · Kay Liu · GitHub · 0d1dcdcd · bb89dee7 · bb89dee7
Unverified Commit bb89dee7 authored Jul 07, 2021 by Kay Liu Committed by GitHub Jul 07, 2021
Show whitespace changes
Inline Side-by-side

Showing with 65 additions and 48 deletions

python/dgl/data/fakenews.py python/dgl/data/fakenews.py +32 -22

python/dgl/data/fraud.py python/dgl/data/fraud.py +33 -26

No files found.
--- a/python/dgl/data/fakenews.py
+++ b/python/dgl/data/fakenews.py
@@ -19,14 +19,18 @@ class FakeNewsDataset(DGLBuiltinDataset):
    the root node represents the news, the leaf nodes are Twitter users
    who retweeted the root news. Besides, the node features are encoded
    user historical tweets using different pretrained language models:
-        bert: the 768-dimensional node feature composed of Twitter user
+
+    - bert: the 768-dimensional node feature composed of Twitter user
    historical tweets encoded by the bert-as-service
-        content: the 310-dimensional node feature composed of a
+
+    - content: the 310-dimensional node feature composed of a
    300-dimensional “spacy” vector plus a 10-dimensional
    “profile” vector
-        profile: the 10-dimensional node feature composed of ten Twitter
+
+    - profile: the 10-dimensional node feature composed of ten Twitter
    user profile attributes.
-        spacy: the 300-dimensional node feature composed of Twitter user
+
+    - spacy: the 300-dimensional node feature composed of Twitter user
    historical tweets encoded by the spaCy word2vec encoder.

    Note: this dataset is for academic use only, and commercial use is prohibited.
@@ -39,27 +43,33 @@ class FakeNewsDataset(DGLBuiltinDataset):
        - Nodes: 41,054
        - Edges: 40,740
        - Classes:
-            Fake: 157
-            Real: 157
+
+            - Fake: 157
+            - Real: 157
+
        - Node feature size:
-            bert: 768
-            content: 310
-            profile: 10
-            spacy: 300
+
+            - bert: 768
+            - content: 310
+            - profile: 10
+            - spacy: 300

        Gossipcop:

-        - Graphs: 5464
+        - Graphs: 5,464
        - Nodes: 314,262
        - Edges: 308,798
        - Classes:
-            Fake: 2732
-            Real: 2732
+
+            - Fake: 2,732
+            - Real: 2,732
+
        - Node feature size:
-            bert: 768
-            content: 310
-            profile: 10
-            spacy: 300
+
+            - bert: 768
+            - content: 310
+            - profile: 10
+            - spacy: 300

    Parameters
    ----------

--- a/python/dgl/data/fraud.py
+++ b/python/dgl/data/fraud.py
@@ -177,15 +177,15 @@ class FraudDataset(DGLBuiltinDataset):
            "must between 0 and 1 (inclusive)."

        N = x.shape[0]
-        index = list(range(N))
+        index = np.arange(N)
        if self.name == 'amazon':
            # 0-3304 are unlabeled nodes
-            index = list(range(3305, N))
+            index = np.arange(3305, N)

-        np.random.RandomState(seed).permutation(index)
-        train_idx = index[:int(train_size * N)]
-        val_idx = index[int(N - val_size * N):]
-        test_idx = index[int(train_size * N):int(N - val_size * N)]
+        index = np.random.RandomState(seed).permutation(index)
+        train_idx = index[:int(train_size * len(index))]
+        val_idx = index[len(index) - int(val_size * len(index)):]
+        test_idx = index[int(train_size * len(index)):len(index) - int(val_size * len(index))]
        train_mask = np.zeros(N, dtype=np.bool)
        val_mask = np.zeros(N, dtype=np.bool)
        test_mask = np.zeros(N, dtype=np.bool)
@@ -202,9 +202,9 @@ class FraudYelpDataset(FraudDataset):

    The Yelp dataset includes hotel and restaurant reviews filtered (spam) and recommended
    (legitimate) by Yelp. A spam review detection task can be conducted, which is a binary
-    classification task. 32 handcrafted features from
-    <http://dx.doi.org/10.1145/2783258.2783370> are taken as the raw node features. Reviews
-    are nodes in the graph, and three relations are:
+    classification task. 32 handcrafted features from <http://dx.doi.org/10.1145/2783258.2783370>
+    are taken as the raw node features. Reviews are nodes in the graph, and three relations are:
+
        1. R-U-R: it connects reviews posted by the same user
        2. R-S-R: it connects reviews under the same product with the same star rating (1-5 stars)
        3. R-T-R: it connects two reviews under the same product posted in the same month.
@@ -213,13 +213,16 @@ class FraudYelpDataset(FraudDataset):

    - Nodes: 45,954
    - Edges:
-        R-U-R: 49,315
-        R-T-R: 573,616
-        R-S-R: 3,402,743
-        ALL: 3,846,979
+
+        - R-U-R: 98,630
+        - R-T-R: 1,147,232
+        - R-S-R: 6,805,486
+
    - Classes:
-        Positive (spam): 6,677
-        Negative (legitimate): 39,277
+
+        - Positive (spam): 6,677
+        - Negative (legitimate): 39,277
+
    - Positive-Negative ratio: 1 : 5.9
    - Node feature size: 32

@@ -278,14 +281,18 @@ class FraudAmazonDataset(FraudDataset):

    - Nodes: 11,944
    - Edges:
-        U-P-U: 175,608
-        U-S-U: 3,566,479
-        U-V-U: 1,036,737
-        ALL: 4,398,392
+
+        - U-P-U: 351,216
+        - U-S-U: 7,132,958
+        - U-V-U: 2,073,474
+
    - Classes:
-        Positive (fraudulent): 821
-        Negative (benign): 11,123
-    - Positive-Negative ratio: 1 : 13.5
+
+        - Positive (fraudulent): 821
+        - Negative (benign): 7,818
+        - Unlabeled: 3,305
+
+    - Positive-Negative ratio: 1 : 10.5
    - Node feature size: 25

    Parameters