DGL Enter (#3690)

* add * fix * fix * fix * fix * add * add * fix * fix * fix * new loader * fix * fix * fix for 3.6 * fix * add * add receipes and also some bug fixes * fix * fix * fix * fix receipies * allow AsNodeDataset to work on ogb * add ut * many fixes for nodepred-ns pipeline * receipe for nodepred-ns * Update enter/README.md Co-authored-by: Zihao Ye <zihaoye.cs@gmail.com> * fix layers * fix * fix * fix * fix * fix multiple issues * fix for citation2 * fix comment * fix * fix * clean up * fix Co-authored-by: Minjie Wang <wmjlyjemaine@gmail.com> Co-authored-by: Minjie Wang <minjie.wang@nyu.edu> Co-authored-by: Zihao Ye <zihaoye.cs@gmail.com>

DGL Enter (#3690)
* add * fix * fix * fix * fix * add * add * fix * fix * fix * new loader * fix * fix * fix for 3.6 * fix * add * add receipes and also some bug fixes * fix * fix * fix * fix receipies * allow AsNodeDataset to work on ogb * add ut * many fixes for nodepred-ns pipeline * receipe for nodepred-ns * Update enter/README.md Co-authored-by: Zihao Ye <zihaoye.cs@gmail.com> * fix layers * fix * fix * fix * fix * fix multiple issues * fix for citation2 * fix comment * fix * fix * clean up * fix Co-authored-by: Minjie Wang <wmjlyjemaine@gmail.com> Co-authored-by: Minjie Wang <minjie.wang@nyu.edu> Co-authored-by: Zihao Ye <zihaoye.cs@gmail.com>
539335ce · Jinjing Zhou · GitHub · 80fb4dbe · 539335ce · 539335ce
Unverified Commit 539335ce authored Feb 18, 2022 by Jinjing Zhou Committed by GitHub Feb 18, 2022
13 changed files
--- a/enter/recipes/nodepred_citeseer_gat.yml
+++ b/enter/recipes/nodepred_citeseer_gat.yml
+# Accuracy across 10 runs: 0.7097 ± 0.006914
+version: 0.0.1
+pipeline_name: nodepred
+device: cuda:0
+data:
+  name: citeseer
+model:
+  name: gat
+  embed_size: -1              # The dimension of created embedding table. -1 means using original node embedding
+  num_layers: 2               # Number of layers.
+  hidden_size: 8              # Hidden size.
+  heads:
+  - 8
+  - 1
+  activation: elu             # Activation function.
+  feat_drop: 0.6              # Dropout rate for features.
+  attn_drop: 0.6              # Dropout rate for attentions.
+  negative_slope: 0.2
+  residual: false             # If true, the GATConv will use residule connection
+general_pipeline:
+  early_stop:
+    patience: 100             # Steps before early stop
+    checkpoint_path: checkpoint.pth # Early stop checkpoint model file path
+  num_epochs: 200             # Number of training epochs
+  eval_period: 5              # Interval epochs between evaluations
+  optimizer:
+    name: Adam
+    lr: 0.005
+    weight_decay: 0.0005
+  loss: CrossEntropyLoss
+  num_runs: 10               # Number of experiments to run
--- a/enter/recipes/nodepred_citeseer_gcn.yml
+++ b/enter/recipes/nodepred_citeseer_gcn.yml
+# Accuracy across 10 runs: 0.6852 ± 0.008875
+version: 0.0.1
+pipeline_name: nodepred
+device: cuda:0
+data:
+  name: citeseer
+model:
+  name: gcn
+  embed_size: -1              # The dimension of created embedding table. -1 means using original node embedding
+  hidden_size: 16             # Hidden size.
+  num_layers: 2               # Number of layers.
+  norm: both                  # GCN normalization type. Can be 'both', 'right', 'left', 'none'.
+  activation: relu            # Activation function.
+  dropout: 0.5                # Dropout rate.
+  use_edge_weight: false      # If true, scale the messages by edge weights.
+general_pipeline:
+  early_stop:
+    patience: 100             # Steps before early stop
+    checkpoint_path: checkpoint.pth # Early stop checkpoint model file path
+  num_epochs: 200             # Number of training epochs
+  eval_period: 5              # Interval epochs between evaluations
+  optimizer:
+    name: Adam
+    lr: 0.01
+    weight_decay: 0.0005
+  loss: CrossEntropyLoss
+  num_runs: 10                # Number of experiments to run
--- a/enter/recipes/nodepred_citeseer_sage.yml
+++ b/enter/recipes/nodepred_citeseer_sage.yml
+# Accuracy across 10 runs: 0.6994 ± 0.004005
+version: 0.0.1
+pipeline_name: nodepred
+device: cuda:0
+data:
+  name: citeseer
+model:
+  name: sage
+  embed_size: -1              # The dimension of created embedding table. -1 means using original node embedding
+  hidden_size: 16             # Hidden size.
+  num_layers: 2               # Number of layers.
+  activation: relu
+  dropout: 0.5                # Dropout rate.
+  aggregator_type: gcn        # Aggregator type to use (``mean``, ``gcn``, ``pool``, ``lstm``).
+general_pipeline:
+  early_stop:
+    patience: 100             # Steps before early stop
+    checkpoint_path: checkpoint.pth # Early stop checkpoint model file path
+  num_epochs: 200             # Number of training epochs
+  eval_period: 5              # Interval epochs between evaluations
+  optimizer:
+    name: Adam
+    lr: 0.01
+    weight_decay: 0.0005
+  loss: CrossEntropyLoss
+  num_runs: 10                # Number of experiments to run
--- a/enter/recipes/nodepred_cora_gat.yml
+++ b/enter/recipes/nodepred_cora_gat.yml
+# Accuracy across 10 runs: 0.8208 ± 0.00663
+version: 0.0.1
+pipeline_name: nodepred
+device: cuda:0
+data:
+  name: cora
+model:
+  name: gat
+  embed_size: -1              # The dimension of created embedding table. -1 means using original node embedding
+  num_layers: 2               # Number of layers.
+  hidden_size: 8              # Hidden size.
+  heads:
+  - 8
+  - 1
+  activation: elu             # Activation function.
+  feat_drop: 0.6              # Dropout rate for features.
+  attn_drop: 0.6              # Dropout rate for attentions.
+  negative_slope: 0.2
+  residual: false             # If true, the GATConv will use residule connection
+general_pipeline:
+  early_stop:
+    patience: 100             # Steps before early stop
+    checkpoint_path: checkpoint.pth # Early stop checkpoint model file path
+  num_epochs: 200             # Number of training epochs
+  eval_period: 5              # Interval epochs between evaluations
+  optimizer:
+    name: Adam
+    lr: 0.005
+    weight_decay: 0.0005
+  loss: CrossEntropyLoss
+  num_runs: 10                # Number of experiments to run
--- a/enter/recipes/nodepred_cora_gcn.yml
+++ b/enter/recipes/nodepred_cora_gcn.yml
+# Accuracy across 10 runs: 0.802 ± 0.005329
+version: 0.0.1
+pipeline_name: nodepred
+device: cuda:0
+data:
+  name: cora
+model:
+  name: gcn
+  embed_size: -1              # The dimension of created embedding table. -1 means using original node embedding
+  hidden_size: 16             # Hidden size.
+  num_layers: 2               # Number of layers.
+  norm: both                  # GCN normalization type. Can be 'both', 'right', 'left', 'none'.
+  activation: relu            # Activation function.
+  dropout: 0.5                # Dropout rate.
+  use_edge_weight: false      # If true, scale the messages by edge weights.
+general_pipeline:
+  early_stop:
+    patience: 100             # Steps before early stop
+    checkpoint_path: checkpoint.pth # Early stop checkpoint model file path
+  num_epochs: 200             # Number of training epochs
+  eval_period: 5              # Interval epochs between evaluations
+  optimizer:
+    name: Adam
+    lr: 0.01
+    weight_decay: 0.0005
+  loss: CrossEntropyLoss
+  num_runs: 10                # Number of experiments to run
--- a/enter/recipes/nodepred_cora_sage.yml
+++ b/enter/recipes/nodepred_cora_sage.yml
+# Accuracy across 10 runs: 0.8163 ± 0.006856
+version: 0.0.1
+pipeline_name: nodepred
+device: cuda:0
+data:
+  name: cora
+model:
+  name: sage
+  embed_size: -1              # The dimension of created embedding table. -1 means using original node embedding
+  hidden_size: 16             # Hidden size.
+  num_layers: 2               # Number of layers.
+  activation: relu
+  dropout: 0.5                # Dropout rate.
+  aggregator_type: gcn        # Aggregator type to use (``mean``, ``gcn``, ``pool``, ``lstm``).
+general_pipeline:
+  early_stop:
+    patience: 100             # Steps before early stop
+    checkpoint_path: checkpoint.pth # Early stop checkpoint model file path
+  num_epochs: 200             # Number of training epochs
+  eval_period: 5              # Interval epochs between evaluations
+  optimizer:
+    name: Adam
+    lr: 0.01
+    weight_decay: 0.0005
+  loss: CrossEntropyLoss
+  num_runs: 10                # Number of experiments to run
--- a/enter/recipes/nodepred_pubmed_gat.yml
+++ b/enter/recipes/nodepred_pubmed_gat.yml
+# Accuracy across 10 runs: 0.7788 ± 0.002227
+version: 0.0.1
+pipeline_name: nodepred
+device: cuda:0
+data:
+  name: pubmed
+model:
+  name: gat
+  embed_size: -1              # The dimension of created embedding table. -1 means using original node embedding
+  num_layers: 2               # Number of layers.
+  hidden_size: 8              # Hidden size.
+  heads:
+  - 8
+  - 8
+  activation: elu             # Activation function.
+  feat_drop: 0.6              # Dropout rate for features.
+  attn_drop: 0.6              # Dropout rate for attentions.
+  negative_slope: 0.2
+  residual: false             # If true, the GATConv will use residule connection
+general_pipeline:
+  early_stop:
+    patience: 100             # Steps before early stop
+    checkpoint_path: checkpoint.pth # Early stop checkpoint model file path
+  num_epochs: 200             # Number of training epochs
+  eval_period: 5              # Interval epochs between evaluations
+  optimizer:
+    name: Adam
+    lr: 0.005
+    weight_decay: 0.001
+  loss: CrossEntropyLoss
+  num_runs: 10                # Number of experiments to run
--- a/enter/recipes/nodepred_pubmed_gcn.yml
+++ b/enter/recipes/nodepred_pubmed_gcn.yml
+# Accuracy across 10 runs: 0.7826 ± 0.004317
+version: 0.0.1
+pipeline_name: nodepred
+device: cuda:0
+data:
+  name: pubmed
+model:
+  name: gcn
+  embed_size: -1              # The dimension of created embedding table. -1 means using original node embedding
+  hidden_size: 16             # Hidden size.
+  num_layers: 2               # Number of layers.
+  norm: both                  # GCN normalization type. Can be 'both', 'right', 'left', 'none'.
+  activation: relu            # Activation function.
+  dropout: 0.5                # Dropout rate.
+  use_edge_weight: false      # If true, scale the messages by edge weights.
+general_pipeline:
+  early_stop:
+    patience: 100             # Steps before early stop
+    checkpoint_path: checkpoint.pth # Early stop checkpoint model file path
+  num_epochs: 200             # Number of training epochs
+  eval_period: 5              # Interval epochs between evaluations
+  optimizer:
+    name: Adam
+    lr: 0.01
+    weight_decay: 0.0005
+  loss: CrossEntropyLoss
+  num_runs: 10                # Number of experiments to run
--- a/enter/recipes/nodepred_pubmed_sage.yml
+++ b/enter/recipes/nodepred_pubmed_sage.yml
+# Accuracy across 10 runs: 0.7819 ± 0.003176
+version: 0.0.1
+pipeline_name: nodepred
+device: cuda:0
+data:
+  name: pubmed
+model:
+  name: sage
+  embed_size: -1              # The dimension of created embedding table. -1 means using original node embedding
+  hidden_size: 16             # Hidden size.
+  num_layers: 2               # Number of layers.
+  activation: relu
+  dropout: 0.5                # Dropout rate.
+  aggregator_type: gcn        # Aggregator type to use (``mean``, ``gcn``, ``pool``, ``lstm``).
+general_pipeline:
+  early_stop:
+    patience: 100             # Steps before early stop
+    checkpoint_path: checkpoint.pth # Early stop checkpoint model file path
+  num_epochs: 200             # Number of training epochs
+  eval_period: 5              # Interval epochs between evaluations
+  optimizer:
+    name: Adam
+    lr: 0.01
+    weight_decay: 0.0005
+  loss: CrossEntropyLoss
+  num_runs: 10                # Number of experiments to run
--- a/enter/setup.py
+++ b/enter/setup.py
+#!/usr/bin/env python
+
+from setuptools import find_packages
+from distutils.core import setup
+
+setup(name='dglenter',
+      version='0.0.1',
+      description='DGL',
+      author='DGL Team',
+      author_email='wmjlyjemaine@gmail.com',
+      packages=find_packages(),
+      install_requires=[
+          'typer>=0.4.0',
+          'isort>=5.10.1',
+          'autopep8>=1.6.0',
+          'numpydoc>=1.1.0',
+          "pydantic>=1.9.0",
+          "ruamel.yaml>=0.17.20"
+      ],
+    license='APACHE',
+      entry_points={
+          'console_scripts': [
+              "dgl-enter = dglenter.cli.cli:main"
+          ]
+      },
+      url='https://github.com/dmlc/dgl',
+      )
--- a/python/dgl/data/adapter.py
+++ b/python/dgl/data/adapter.py
@@ -72,29 +72,49 @@ class AsNodePredDataset(DGLDataset):
                 split_ratio=None,
                 target_ntype=None,
                 **kwargs):
-        self.g = dataset[0].clone()
+        self.dataset = dataset
        self.split_ratio = split_ratio
        self.target_ntype = target_ntype
-        self.num_classes = getattr(dataset, 'num_classes', None)
-        super().__init__(dataset.name + '-as-nodepred',
+        super().__init__(self.dataset.name + '-as-nodepred',
                         hash_key=(split_ratio, target_ntype), **kwargs)

    def process(self):
+        is_ogb = hasattr(self.dataset, 'get_idx_split')
+        if is_ogb:
+            g, label = self.dataset[0]
+            self.g = g.clone()
+            self.g.ndata['label'] = F.reshape(label, (g.num_nodes(),))
+        else:
+            self.g = self.dataset[0].clone()
+
        if 'label' not in self.g.nodes[self.target_ntype].data:
            raise ValueError("Missing node labels. Make sure labels are stored "
                             "under name 'label'.")
+
        if self.split_ratio is None:
-            assert "train_mask" in self.g.nodes[self.target_ntype].data, \
-                "train_mask is not provided, please specify split_ratio to generate the masks"
-            assert "val_mask" in self.g.nodes[self.target_ntype].data, \
-                "val_mask is not provided, please specify split_ratio to generate the masks"
-            assert "test_mask" in self.g.nodes[self.target_ntype].data, \
-                "test_mask is not provided, please specify split_ratio to generate the masks"
+            if is_ogb:
+                split = self.dataset.get_idx_split()
+                train_idx, val_idx, test_idx = split['train'], split['valid'], split['test']
+                n = self.g.num_nodes()
+                train_mask = utils.generate_mask_tensor(utils.idx2mask(train_idx, n))
+                val_mask = utils.generate_mask_tensor(utils.idx2mask(val_idx, n))
+                test_mask = utils.generate_mask_tensor(utils.idx2mask(test_idx, n))
+                self.g.ndata['train_mask'] = train_mask
+                self.g.ndata['val_mask'] = val_mask
+                self.g.ndata['test_mask'] = test_mask
+            else:
+                assert "train_mask" in self.g.nodes[self.target_ntype].data, \
+                    "train_mask is not provided, please specify split_ratio to generate the masks"
+                assert "val_mask" in self.g.nodes[self.target_ntype].data, \
+                    "val_mask is not provided, please specify split_ratio to generate the masks"
+                assert "test_mask" in self.g.nodes[self.target_ntype].data, \
+                    "test_mask is not provided, please specify split_ratio to generate the masks"
        else:
            if self.verbose:
                print('Generating train/val/test masks...')
            utils.add_nodepred_split(self, self.split_ratio, self.target_ntype)

+        self.num_classes = getattr(self.dataset, 'num_classes', None)
        if self.num_classes is None:
            self.num_classes = len(F.unique(self.g.nodes[self.target_ntype].data['label']))

@@ -169,7 +189,7 @@ class AsLinkPredDataset(DGLDataset):
        Split ratios for training, validation and test sets. Must sum to one.
    neg_ratio : int, optional
        Indicate how much negative samples to be sampled
-        The number of the negative samples will be neg_ratio * num_positive_edges.
+        The number of the negative samples will be equal or less than neg_ratio * num_positive_edges.

    Attributes
    -------
@@ -211,24 +231,42 @@ class AsLinkPredDataset(DGLDataset):

    def process(self):
        if self.split_ratio is None:
+            # Handle logics for OGB link prediction dataset
            assert hasattr(self.dataset, "get_edge_split"), \
                "dataset doesn't have get_edge_split method, please specify split_ratio and neg_ratio to generate the split"
            # This is likely to be an ogb dataset
            self.edge_split = self.dataset.get_edge_split()
            self._train_graph = self.g
-
-            pos_e_tensor, neg_e_tensor = self.edge_split["valid"][
-                "edge"], self.edge_split["valid"]["edge_neg"]
-            pos_e = (pos_e_tensor[:, 0], pos_e_tensor[:, 1])
-            neg_e = (neg_e_tensor[:, 0], neg_e_tensor[:, 1])
-            self._val_edges = pos_e, neg_e
-
-            pos_e_tensor, neg_e_tensor = self.edge_split["test"][
-                "edge"], self.edge_split["test"]["edge_neg"]
-            pos_e = (pos_e_tensor[:, 0], pos_e_tensor[:, 1])
-            neg_e = (neg_e_tensor[:, 0], neg_e_tensor[:, 1])
-            self._test_edges = pos_e, neg_e
+            if 'source_node' in self.edge_split["test"]:
+                # Probably ogbl-citation2
+                pos_e = (self.edge_split["valid"]["source_node"], self.edge_split["valid"]["target_node"])
+                neg_e_size = self.edge_split["valid"]['target_node_neg'].shape[-1]
+                neg_e_src = np.repeat(self.edge_split['valid']['source_node'], neg_e_size)
+                neg_e_dst = np.reshape(self.edge_split["valid"]["target_node_neg"], -1)
+                self._val_edges = pos_e, (neg_e_src, neg_e_dst)
+                pos_e = (self.edge_split["test"]["source_node"], self.edge_split["test"]["target_node"])
+                neg_e_size = self.edge_split["test"]['target_node_neg'].shape[-1]
+                neg_e_src = np.repeat(self.edge_split['test']['source_node'], neg_e_size)
+                neg_e_dst = np.reshape(self.edge_split["test"]["target_node_neg"], -1)
+                self._test_edges = pos_e, (neg_e_src, neg_e_dst)
+            elif 'edge' in self.edge_split["test"]:
+                # Probably ogbl-collab
+                pos_e_tensor, neg_e_tensor = self.edge_split["valid"][
+                    "edge"], self.edge_split["valid"]["edge_neg"]
+                pos_e = (pos_e_tensor[:, 0], pos_e_tensor[:, 1])
+                neg_e = (neg_e_tensor[:, 0], neg_e_tensor[:, 1])
+                self._val_edges = pos_e, neg_e
+
+                pos_e_tensor, neg_e_tensor = self.edge_split["test"][
+                    "edge"], self.edge_split["test"]["edge_neg"]
+                pos_e = (pos_e_tensor[:, 0], pos_e_tensor[:, 1])
+                neg_e = (neg_e_tensor[:, 0], neg_e_tensor[:, 1])
+                self._test_edges = pos_e, neg_e
+            # delete edge split to save memory
+            self.edge_split = None
        else:
+            assert self.split_ratio is not None, "Need to specify split_ratio"
+            assert self.neg_ratio is not None, "Need to specify neg_ratio"
            ratio = self.split_ratio
            graph = self.dataset[0]
            n = graph.num_edges()

--- a/src/runtime/tensordispatch.cc
+++ b/src/runtime/tensordispatch.cc
@@ -40,7 +40,7 @@ bool TensorDispatcher::Load(const char *path) {
  handle_ = dlopen(path, RTLD_LAZY);

  if (!handle_) {
-    LOG(WARNING) << "TensorDispatcher: dlopen failed: " << dlerror();
+    DLOG(WARNING) << "TensorDispatcher: dlopen failed: " << dlerror();
    return false;
  }


--- a/tests/compute/test_data.py
+++ b/tests/compute/test_data.py
@@ -1154,7 +1154,12 @@ def test_as_nodepred2():
    ds = data.AsNodePredDataset(data.AIFBDataset(), [0.1, 0.1, 0.8], 'Personen', verbose=True)
    assert F.sum(F.astype(ds[0].nodes['Personen'].data['train_mask'], F.int32), 0) == int(ds[0].num_nodes('Personen') * 0.1)

-
+@unittest.skipIf(dgl.backend.backend_name != 'pytorch', reason="ogb only supports pytorch")
+def test_as_nodepred_ogb():
+    from ogb.nodeproppred import DglNodePropPredDataset
+    ds = data.AsNodePredDataset(DglNodePropPredDataset("ogbn-arxiv"), split_ratio=None, verbose=True)
+    # force generate new split
+    ds = data.AsNodePredDataset(DglNodePropPredDataset("ogbn-arxiv"), split_ratio=[0.7, 0.2, 0.1], verbose=True)

 @unittest.skipIf(F._default_context_str == 'gpu', reason="Datasets don't need to be tested on GPU.")
 def test_as_linkpred():