[Example] GCMC with sampling (#1296)

* gcmc example * Update Readme * Add multiprocess support * Fix * Multigpu + dataloader * Delete some dead code * Delete more * upd * Add README * upd * Upd * combine full batch and sample GCMCLayer, use HeteroCov * Fix * Update Readme * udp * Fix typo * Add cpu run * some fix and docstring Co-authored-by: Ubuntu <ubuntu@ip-172-31-51-214.ec2.internal> Co-authored-by: Minjie Wang <wmjlyjemaine@gmail.com> Co-authored-by: Ubuntu <ubuntu@ip-172-31-63-71.ec2.internal>

[Example] GCMC with sampling (#1296)
* gcmc example * Update Readme * Add multiprocess support * Fix * Multigpu + dataloader * Delete some dead code * Delete more * upd * Add README * upd * Upd * combine full batch and sample GCMCLayer, use HeteroCov * Fix * Update Readme * udp * Fix typo * Add cpu run * some fix and docstring Co-authored-by: Ubuntu <ubuntu@ip-172-31-51-214.ec2.internal> Co-authored-by: Minjie Wang <wmjlyjemaine@gmail.com> Co-authored-by: Ubuntu <ubuntu@ip-172-31-63-71.ec2.internal>
28117cd9 · xiang song(charlie.song) · GitHub · bd1e48a5 · 28117cd9 · 28117cd9
Unverified Commit 28117cd9 authored Apr 13, 2020 by xiang song(charlie.song) Committed by GitHub Apr 13, 2020
5 changed files
--- a/examples/pytorch/gcmc/README.md
+++ b/examples/pytorch/gcmc/README.md
@@ -11,41 +11,217 @@ Credit: Jiani Zhang ([@jennyzhang0215](https://github.com/jennyzhang0215))
 ## Dependencies
 * PyTorch 1.2+
 * pandas
-* torchtext 0.4+
+* torchtext 0.4+ (if using user and item contents as node features)
 ## Data
 Supported datasets: ml-100k, ml-1m, ml-10m
 ## How to run
+### Train with full-graph
 ml-100k, no feature
 ```bash
-python train.py --data_name=ml-100k --use_one_hot_fea --gcn_agg_accum=stack
+python3 train.py --data_name=ml-100k --use_one_hot_fea --gcn_agg_accum=stack
 ```
 Results: RMSE=0.9088 (0.910 reported)
-Speed: 0.0195s/epoch (vanilla implementation: 0.1008s/epoch)
+Speed: 0.0410s/epoch (vanilla implementation: 0.1008s/epoch)
 ml-100k, with feature
 ```bash
-python train.py --data_name=ml-100k --gcn_agg_accum=stack
+python3 train.py --data_name=ml-100k --gcn_agg_accum=stack
 ```
 Results: RMSE=0.9448 (0.905 reported)
 ml-1m, no feature
 ```bash
-python train.py --data_name=ml-1m --gcn_agg_accum=sum --use_one_hot_fea
+python3 train.py --data_name=ml-1m --gcn_agg_accum=sum --use_one_hot_fea
 ```
 Results: RMSE=0.8377 (0.832 reported)
-Speed: 0.0557s/epoch (vanilla implementation: 1.538s/epoch)
+Speed: 0.0844s/epoch (vanilla implementation: 1.538s/epoch)
 ml-10m, no feature
 ```bash
-python train.py --data_name=ml-10m --gcn_agg_accum=stack --gcn_dropout=0.3 \
+python3 train.py --data_name=ml-10m --gcn_agg_accum=stack --gcn_dropout=0.3 \
                                 --train_lr=0.001 --train_min_lr=0.0001 --train_max_iter=15000 \
                                 --use_one_hot_fea --gen_r_num_basis_func=4
 ```
 Results: RMSE=0.7800 (0.777 reported)
-Speed: 0.9207/epoch (vanilla implementation: OOM)
+Speed: 1.1982/epoch (vanilla implementation: OOM)
 Testbed: EC2 p3.2xlarge instance(Amazon Linux 2)
+### Train with minibatch on a single GPU
+ml-100k, no feature
+```bash
+python3 train_sampling.py --data_name=ml-100k \
+                          --use_one_hot_fea \
+                          --gcn_agg_accum=stack \
+                          --gpu 0
+```
+ml-100k, no feature with mix_cpu_gpu run, for mix_cpu_gpu run with no feature, the W_r is stored in CPU by default other than in GPU.
+```bash
+python3 train_sampling.py --data_name=ml-100k \
+                          --use_one_hot_fea \
+                          --gcn_agg_accum=stack \
+                          --mix_cpu_gpu \
+                          --gpu 0 
+```
+Results: RMSE=0.9380
+Speed: 1.059s/epoch (Run with 70 epoches)
+Speed: 1.046s/epoch (mix_cpu_gpu)
+ml-100k, with feature
+```bash
+python3 train_sampling.py --data_name=ml-100k \
+                          --gcn_agg_accum=stack \
+                          --train_max_epoch 90 \
+                          --gpu 0
+```
+Results: RMSE=0.9574
+ml-1m, no feature
+```bash
+python3 train_sampling.py --data_name=ml-1m \
+                          --gcn_agg_accum=sum \
+                          --use_one_hot_fea \
+                          --train_max_epoch 160 \
+                          --gpu 0
+```
+ml-1m, no feature with mix_cpu_gpu run
+```bash
+python3 train_sampling.py --data_name=ml-1m \
+                          --gcn_agg_accum=sum \
+                          --use_one_hot_fea \
+                          --train_max_epoch 60 \
+                          --mix_cpu_gpu \
+                          --gpu 0
+```
+Results: RMSE=0.8632
+Speed: 7.852s/epoch (Run with 60 epoches)
+Speed: 7.788s/epoch (mix_cpu_gpu)
+ml-10m, no feature
+```bash
+python3 train_sampling.py --data_name=ml-10m \
+                          --gcn_agg_accum=stack \
+                          --gcn_dropout=0.3 \
+                          --train_lr=0.001 \
+                          --train_min_lr=0.0001 \
+                          --train_max_epoch=60 \
+                          --use_one_hot_fea \
+                          --gen_r_num_basis_func=4 \
+                          --gpu 0
+```
+ml-10m, no feature with mix_cpu_gpu run
+```bash
+python3 train_sampling.py --data_name=ml-10m \
+                          --gcn_agg_accum=stack \
+                          --gcn_dropout=0.3 \
+                          --train_lr=0.001 \
+                          --train_min_lr=0.0001 \
+                          --train_max_epoch=60 \
+                          --use_one_hot_fea \
+                          --gen_r_num_basis_func=4 \
+                          --mix_cpu_gpu \
+                          --gpu 0
+```
+Results: RMSE=0.8050
+Speed: 394.304s/epoch (Run with 60 epoches)
+Speed: 408.749s/epoch (mix_cpu_gpu)
+Testbed: EC2 p3.2xlarge instance
+### Train with minibatch on multi-GPU
+ml-100k, no feature
+```bash
+python train_sampling.py --data_name=ml-100k \
+                         --gcn_agg_accum=stack \
+                         --train_max_epoch 30 \
+                         --train_lr 0.02 \
+                         --use_one_hot_fea \
+                         --gpu 0,1,2,3,4,5,6,7
+```
+ml-100k, no feature with mix_cpu_gpu run
+```bash
+python train_sampling.py --data_name=ml-100k \
+                         --gcn_agg_accum=stack \
+                         --train_max_epoch 30 \
+                         --train_lr 0.02 \
+                         --use_one_hot_fea \
+                         --mix_cpu_gpu \
+                         --gpu 0,1,2,3,4,5,6,7
+```
+Result: RMSE=0.9397
+Speed: 1.202s/epoch (Run with only 30 epoches) 
+Speed: 1.245/epoch (mix_cpu_gpu)
+ml-100k, with feature
+```bash
+python train_sampling.py --data_name=ml-100k \
+                         --gcn_agg_accum=stack \
+                         --train_max_epoch 30 \
+                         --gpu 0,1,2,3,4,5,6,7
+```
+Result: RMSE=0.9655
+Speed:  1.265/epoch (Run with 30 epoches)
+ml-1m, no feature
+```bash
+python train_sampling.py --data_name=ml-1m \
+                         --gcn_agg_accum=sum \
+                         --train_max_epoch 40 \
+                         --use_one_hot_fea \
+                         --gpu 0,1,2,3,4,5,6,7
+```
+ml-1m, no feature with mix_cpu_gpu run
+```bash
+python train_sampling.py --data_name=ml-1m \
+                         --gcn_agg_accum=sum \
+                         --train_max_epoch 40 \
+                         --use_one_hot_fea \
+                         --mix_cpu_gpu \
+                         --gpu 0,1,2,3,4,5,6,7
+```
+Results: RMSE=0.8621
+Speed: 11.612s/epoch (Run with 40 epoches)
+Speed: 12.483s/epoch (mix_cpu_gpu)
+ml-10m, no feature
+```bash
+python train_sampling.py --data_name=ml-10m \
+                         --gcn_agg_accum=stack \
+                         --gcn_dropout=0.3 \
+                         --train_lr=0.001 \
+                         --train_min_lr=0.0001 \
+                         --train_max_epoch=30 \
+                         --use_one_hot_fea \
+                         --gen_r_num_basis_func=4 \
+                         --gpu 0,1,2,3,4,5,6,7
+```
+ml-10m, no feature with mix_cpu_gpu run
+```bash
+python train_sampling.py --data_name=ml-10m \
+                         --gcn_agg_accum=stack \
+                         --gcn_dropout=0.3 \
+                         --train_lr=0.001 \
+                         --train_min_lr=0.0001 \
+                         --train_max_epoch=30 \
+                         --use_one_hot_fea \
+                         --gen_r_num_basis_func=4 \
+                         --mix_cpu_gpu \
+                         --gpu 0,1,2,3,4,5,6,7
+```
+Results: RMSE=0.8084
+Speed: 632.868s/epoch (Run with 30 epoches)
+Speed: 633.397s/epoch (mix_cpu_gpu)
+Testbed: EC2 p3.16xlarge instance
+### Train with minibatch on CPU
+ml-100k, no feature
+```bash
+python3 train_sampling.py --data_name=ml-100k \
+                          --use_one_hot_fea \
+                          --gcn_agg_accum=stack \
+                          --gpu -1
+```
+Speed 1.591s/epoch
+Testbed: EC2 r5.xlarge instance
--- a/examples/pytorch/gcmc/data.py
+++ b/examples/pytorch/gcmc/data.py
@@ -5,8 +5,6 @@ import re
 import pandas as pd
 import scipy.sparse as sp
 import torch as th
-from torchtext import data
-from torchtext.vocab import GloVe
 import dgl
 from dgl.data.utils import download, extract_archive, get_download_dir
@@ -84,6 +82,8 @@ class MovieLens(object):
        Dataset name. Could be "ml-100k", "ml-1m", "ml-10m"
    device : torch.device
        Device context
+    mix_cpu_gpu : boo, optional
+        If true, the ``user_feature`` attribute is stored in CPU
    use_one_hot_fea : bool, optional
        If true, the ``user_feature`` attribute is None, representing an one-hot identity
        matrix. (Default: False)
@@ -96,7 +96,8 @@ class MovieLens(object):
        Ratio of validation data
    """
-    def __init__(self, name, device, use_one_hot_fea=False, symm=True,
+    def __init__(self, name, device, mix_cpu_gpu=False,
+                 use_one_hot_fea=False, symm=True,
                 test_ratio=0.1, valid_ratio=0.1):
        self._name = name
        self._device = device
@@ -164,8 +165,13 @@ class MovieLens(object):
            self.user_feature = None
            self.movie_feature = None
        else:
-            self.user_feature = th.FloatTensor(self._process_user_fea()).to(device)
+            # if mix_cpu_gpu, we put features in CPU
-            self.movie_feature = th.FloatTensor(self._process_movie_fea()).to(device)
+            if mix_cpu_gpu:
+                self.user_feature = th.FloatTensor(self._process_user_fea())
+                self.movie_feature = th.FloatTensor(self._process_movie_fea())
+            else:
+                self.user_feature = th.FloatTensor(self._process_user_fea()).to(self._device)
+                self.movie_feature = th.FloatTensor(self._process_movie_fea()).to(self._device)
        if self.user_feature is None:
            self.user_feature_shape = (self.num_user, self.num_user)
            self.movie_feature_shape = (self.num_movie, self.num_movie)
@@ -204,6 +210,7 @@ class MovieLens(object):
        def _npairs(graph):
            rst = 0
            for r in self.possible_rating_values:
+                r = str(r).replace('.', '_')
                rst += graph.number_of_edges(str(r))
            return rst
@@ -245,9 +252,10 @@ class MovieLens(object):
            ridx = np.where(rating_values == rating)
            rrow = rating_row[ridx]
            rcol = rating_col[ridx]
-            bg = dgl.bipartite((rrow, rcol), 'user', str(rating), 'movie',
+            rating = str(rating).replace('.', '_')
+            bg = dgl.bipartite((rrow, rcol), 'user', rating, 'movie',
                               num_nodes=(self._num_user, self._num_movie))
-            rev_bg = dgl.bipartite((rcol, rrow), 'movie', 'rev-%s' % str(rating), 'user',
+            rev_bg = dgl.bipartite((rcol, rrow), 'movie', 'rev-%s' % rating, 'user',
                               num_nodes=(self._num_movie, self._num_user))
            rating_graphs.append(bg)
            rating_graphs.append(rev_bg)
@@ -267,7 +275,7 @@ class MovieLens(object):
            movie_ci = []
            movie_cj = []
            for r in self.possible_rating_values:
-                r = str(r)
+                r = str(r).replace('.', '_')
                user_ci.append(graph['rev-%s' % r].in_degrees())
                movie_ci.append(graph[r].in_degrees())
                if self._symm:
@@ -494,6 +502,8 @@ class MovieLens(object):
            Generate movie features by concatenating embedding and the year
        """
+        import torchtext
        if self._name == 'ml-100k':
            GENRES = GENRES_ML_100K
        elif self._name == 'ml-1m':
@@ -503,8 +513,8 @@ class MovieLens(object):
        else:
            raise NotImplementedError
-        TEXT = data.Field(tokenize='spacy')
+        TEXT = torchtext.data.Field(tokenize='spacy')
-        embedding = GloVe(name='840B', dim=300)
+        embedding = torchtext.vocab.GloVe(name='840B', dim=300)
        title_embedding = np.zeros(shape=(self.movie_info.shape[0], 300), dtype=np.float32)
        release_years = np.zeros(shape=(self.movie_info.shape[0], 1), dtype=np.float32)

--- a/examples/pytorch/gcmc/model.py
+++ b/examples/pytorch/gcmc/model.py
 """NN modules"""
 import torch as th
 import torch.nn as nn
+from torch.nn import init
 import dgl.function as fn
+import dgl.nn.pytorch as dglnn
 from utils import get_activation
+class GCMCGraphConv(nn.Module):
+    """Graph convolution module used in the GCMC model.
+    Parameters
+    ----------
+    in_feats : int
+        Input feature size.
+    out_feats : int
+        Output feature size.
+    weight : bool, optional
+        If True, apply a linear layer. Otherwise, aggregating the messages
+        without a weight matrix or with an shared weight provided by caller.
+    device: str, optional
+        Which device to put data in. Useful in mix_cpu_gpu training and
+        multi-gpu training
+    """
+    def __init__(self,
+                 in_feats,
+                 out_feats,
+                 weight=True,
+                 device=None,
+                 dropout_rate=0.0):
+        super(GCMCGraphConv, self).__init__()
+        self._in_feats = in_feats
+        self._out_feats = out_feats
+        self.device = device 
+        self.dropout = nn.Dropout(dropout_rate)
+        if weight:
+            self.weight = nn.Parameter(th.Tensor(in_feats, out_feats))
+        else:
+            self.register_parameter('weight', None)
+        self.reset_parameters()
+    def reset_parameters(self):
+        """Reinitialize learnable parameters."""
+        if self.weight is not None:
+            init.xavier_uniform_(self.weight)
+    def forward(self, graph, feat, weight=None):
+        """Compute graph convolution.
+        Normalizer constant :math:`c_{ij}` is stored as two node data "ci"
+        and "cj".
+        Parameters
+        ----------
+        graph : DGLGraph
+            The graph.
+        feat : torch.Tensor
+            The input feature
+        weight : torch.Tensor, optional
+            Optional external weight tensor.
+        dropout : torch.nn.Dropout, optional
+            Optional external dropout layer.
+        Returns
+        -------
+        torch.Tensor
+            The output feature
+        """
+        with graph.local_scope():
+            cj = graph.srcdata['cj']
+            ci = graph.dstdata['ci']
+            if self.device is not None:
+                cj = cj.to(self.device)
+                ci = ci.to(self.device)
+            if weight is not None:
+                if self.weight is not None:
+                    raise DGLError('External weight is provided while at the same time the'
+                                   ' module has defined its own weight parameter. Please'
+                                   ' create the module with flag weight=False.')
+            else:
+                weight = self.weight
+            if weight is not None:
+                feat = dot_or_identity(feat, weight, self.device)
+            feat = feat * self.dropout(cj)
+            graph.srcdata['h'] = feat
+            graph.update_all(fn.copy_src(src='h', out='m'),
+                             fn.sum(msg='m', out='h'))
+            rst = graph.dstdata['h']
+            rst = rst * ci
+        return rst
 class GCMCLayer(nn.Module):
    r"""GCMC layer
@@ -49,6 +138,9 @@ class GCMCLayer(nn.Module):
        If true, user node and movie node share the same set of parameters.
        Require ``user_in_units`` and ``move_in_units`` to be the same.
        (Default: False)
+    device: str, optional
+        Which device to put data in. Useful in mix_cpu_gpu training and
+        multi-gpu training
    """
    def __init__(self,
                 rating_vals,
@@ -60,7 +152,8 @@ class GCMCLayer(nn.Module):
                 agg='stack',  # or 'sum'
                 agg_act=None,
                 out_act=None,
-                 share_user_item_param=False):
+                 share_user_item_param=False,
+                 device=None):
        super(GCMCLayer, self).__init__()
        self.rating_vals = rating_vals
        self.agg = agg
@@ -77,19 +170,57 @@ class GCMCLayer(nn.Module):
            msg_units = msg_units // len(rating_vals)
        self.dropout = nn.Dropout(dropout_rate)
        self.W_r = nn.ParameterDict()
+        subConv = {}
        for rating in rating_vals:
            # PyTorch parameter name can't contain "."
            rating = str(rating).replace('.', '_')
+            rev_rating = 'rev-%s' % rating
            if share_user_item_param and user_in_units == movie_in_units:
                self.W_r[rating] = nn.Parameter(th.randn(user_in_units, msg_units))
                self.W_r['rev-%s' % rating] = self.W_r[rating]
+                subConv[rating] = GCMCGraphConv(user_in_units,
+                                                msg_units,
+                                                weight=False,
+                                                device=device,
+                                                dropout_rate=dropout_rate)
+                subConv[rev_rating] = GCMCGraphConv(user_in_units,
+                                                    msg_units,
+                                                    weight=False,
+                                                    device=device,
+                                                    dropout_rate=dropout_rate)
            else:
-                self.W_r[rating] = nn.Parameter(th.randn(user_in_units, msg_units))
+                self.W_r = None
-                self.W_r['rev-%s' % rating] = nn.Parameter(th.randn(movie_in_units, msg_units))
+                subConv[rating] = GCMCGraphConv(user_in_units,
+                                                msg_units,
+                                                weight=True,
+                                                device=device,
+                                                dropout_rate=dropout_rate)
+                subConv[rev_rating] = GCMCGraphConv(movie_in_units,
+                                                    msg_units,
+                                                    weight=True,
+                                                    device=device,
+                                                    dropout_rate=dropout_rate)
+        self.conv = dglnn.HeteroGraphConv(subConv, aggregate=agg)
        self.agg_act = get_activation(agg_act)
        self.out_act = get_activation(out_act)
+        self.device = device
        self.reset_parameters()
+    def partial_to(self, device):
+        """Put parameters into device except W_r
+        Parameters
+        ----------
+        device : torch device
+            Which device the parameters are put in.
+        """
+        assert device == self.device
+        if device is not None:
+            self.ufc.cuda(device)
+            if self.share_user_item_param is False:
+                self.ifc.cuda(device)
+            self.dropout.cuda(device)
    def reset_parameters(self):
        for p in self.parameters():
            if p.dim() > 1:
@@ -98,9 +229,6 @@ class GCMCLayer(nn.Module):
    def forward(self, graph, ufeat=None, ifeat=None):
        """Forward function
-        Normalizer constant :math:`c_{ij}` is stored as two node data "ci"
-        and "cj".
        Parameters
        ----------
        graph : DGLHeteroGraph
@@ -118,28 +246,19 @@ class GCMCLayer(nn.Module):
        new_ifeat : torch.Tensor
            New movie features
        """
-        num_u = graph.number_of_nodes('user')
+        in_feats = {'user' : ufeat, 'movie' : ifeat}
-        num_i = graph.number_of_nodes('movie')
+        mod_args = {}
-        funcs = {}
        for i, rating in enumerate(self.rating_vals):
-            rating = str(rating)
+            rating = str(rating).replace('.', '_')
-            # W_r * x
+            rev_rating = 'rev-%s' % rating
-            x_u = dot_or_identity(ufeat, self.W_r[rating.replace('.', '_')])
+            mod_args[rating] = (self.W_r[rating] if self.W_r is not None else None,)
-            x_i = dot_or_identity(ifeat, self.W_r['rev-%s' % rating.replace('.', '_')])
+            mod_args[rev_rating] = (self.W_r[rev_rating] if self.W_r is not None else None,)
-            # left norm and dropout
+        out_feats = self.conv(graph, in_feats, mod_args=mod_args)
-            x_u = x_u * self.dropout(graph.nodes['user'].data['cj'])
+        ufeat = out_feats['user']
-            x_i = x_i * self.dropout(graph.nodes['movie'].data['cj'])
+        ifeat = out_feats['movie']
-            graph.nodes['user'].data['h%d' % i] = x_u
+        ufeat = ufeat.view(ufeat.shape[0], -1)
-            graph.nodes['movie'].data['h%d' % i] = x_i
+        ifeat = ifeat.view(ifeat.shape[0], -1)
-            funcs[rating] = (fn.copy_u('h%d' % i, 'm'), fn.sum('m', 'h'))
-            funcs['rev-%s' % rating] = (fn.copy_u('h%d' % i, 'm'), fn.sum('m', 'h'))
-        # message passing
-        graph.multi_update_all(funcs, self.agg)
-        ufeat = graph.nodes['user'].data.pop('h').view(num_u, -1)
-        ifeat = graph.nodes['movie'].data.pop('h').view(num_i, -1)
-        # right norm
-        ufeat = ufeat * graph.nodes['user'].data['ci']
-        ifeat = ifeat * graph.nodes['movie'].data['ci']
        # fc and non-linear
        ufeat = self.agg_act(ufeat)
        ifeat = self.agg_act(ifeat)
@@ -150,7 +269,10 @@ class GCMCLayer(nn.Module):
        return self.out_act(ufeat), self.out_act(ifeat)
 class BiDecoder(nn.Module):
-    r"""Bilinear decoder.
+    r"""Bi-linear decoder.
+    Given a bipartite graph G, for each edge (i, j) ~ G, compute the likelihood
+    of it being class r by:
    .. math::
        p(M_{ij}=r) = \text{softmax}(u_i^TQ_rv_j)
@@ -163,28 +285,27 @@ class BiDecoder(nn.Module):
    Parameters
    ----------
-    rating_vals : list of int or float
-        Possible rating values.
    in_units : int
        Size of input user and movie features
-    num_basis_functions : int, optional
+    num_classes : int
+        Number of classes.
+    num_basis : int, optional
        Number of basis. (Default: 2)
    dropout_rate : float, optional
        Dropout raite (Default: 0.0)
    """
    def __init__(self,
-                 rating_vals,
                 in_units,
-                 num_basis_functions=2,
+                 num_classes,
+                 num_basis=2,
                 dropout_rate=0.0):
        super(BiDecoder, self).__init__()
-        self.rating_vals = rating_vals
+        self._num_basis = num_basis
-        self._num_basis_functions = num_basis_functions
        self.dropout = nn.Dropout(dropout_rate)
        self.Ps = nn.ParameterList()
-        for i in range(num_basis_functions):
+        for i in range(num_basis):
            self.Ps.append(nn.Parameter(th.randn(in_units, in_units)))
-        self.rate_out = nn.Linear(self._num_basis_functions, len(rating_vals), bias=False)
+        self.combine_basis = nn.Linear(self._num_basis, num_classes, bias=False)
        self.reset_parameters()
    def reset_parameters(self):
@@ -209,22 +330,83 @@ class BiDecoder(nn.Module):
        th.Tensor
            Predicting scores for each user-movie edge.
        """
-        graph = graph.local_var()
+        with graph.local_scope():
            ufeat = self.dropout(ufeat)
            ifeat = self.dropout(ifeat)
            graph.nodes['movie'].data['h'] = ifeat
            basis_out = []
-        for i in range(self._num_basis_functions):
+            for i in range(self._num_basis):
                graph.nodes['user'].data['h'] = ufeat @ self.Ps[i]
                graph.apply_edges(fn.u_dot_v('h', 'h', 'sr'))
                basis_out.append(graph.edata['sr'].unsqueeze(1))
            out = th.cat(basis_out, dim=1)
-        out = self.rate_out(out)
+            out = self.combine_basis(out)
+        return out
+class DenseBiDecoder(BiDecoder):
+    r"""Dense bi-linear decoder.
+    Dense implementation of the bi-linear decoder used in GCMC. Suitable when
+    the graph can be efficiently represented by a pair of arrays (one for source
+    nodes; one for destination nodes).
+    Parameters
+    ----------
+    in_units : int
+        Size of input user and movie features
+    num_classes : int
+        Number of classes.
+    num_basis : int, optional
+        Number of basis. (Default: 2)
+    dropout_rate : float, optional
+        Dropout raite (Default: 0.0)
+    """
+    def __init__(self,
+                 in_units,
+                 num_classes,
+                 num_basis=2,
+                 dropout_rate=0.0):
+        super(DenseBiDecoder, self).__init__(in_units,
+                                             num_classes,
+                                             num_basis,
+                                             dropout_rate)
+    def forward(self, ufeat, ifeat):
+        """Forward function.
+        Compute logits for each pair ``(ufeat[i], ifeat[i])``.
+        Parameters
+        ----------
+        ufeat : th.Tensor
+            User embeddings. Shape: (B, D)
+        ifeat : th.Tensor
+            Movie embeddings. Shape: (B, D)
+        Returns
+        -------
+        th.Tensor
+            Predicting scores for each user-movie edge. Shape: (B, num_classes)
+        """
+        ufeat = self.dropout(ufeat)
+        ifeat = self.dropout(ifeat)
+        basis_out = []
+        for i in range(self._num_basis):
+            ufeat_i = ufeat @ self.Ps[i]
+            out = th.einsum('ab,ab->a', ufeat_i, ifeat)
+            basis_out.append(out.unsqueeze(1))
+        out = th.cat(basis_out, dim=1)
+        out = self.combine_basis(out)
        return out
-def dot_or_identity(A, B):
+def dot_or_identity(A, B, device=None):
    # if A is None, treat as identity matrix
    if A is None:
        return B
+    elif len(A.shape) == 1:
+        if device is None:
+            return B[A]
+        else:
+            return B[A].to(device)
    else:
        return A @ B
--- a/examples/pytorch/gcmc/train.py
+++ b/examples/pytorch/gcmc/train.py
-"""Training script"""
+"""Training GCMC model on the MovieLens data set.
+The script loads the full graph to the training device.
+"""
 import os, time
 import argparse
 import logging
@@ -8,7 +11,7 @@ import numpy as np
 import torch as th
 import torch.nn as nn
 from data import MovieLens
-from model import GCMCLayer, BiDecoder
+from model import BiDecoder, GCMCLayer
 from utils import get_activation, get_optimizer, torch_total_param_num, torch_net_info, MetricLogger
 class Net(nn.Module):
@@ -23,10 +26,11 @@ class Net(nn.Module):
                                 args.gcn_dropout,
                                 args.gcn_agg_accum,
                                 agg_act=self._act,
-                                 share_user_item_param=args.share_param)
+                                 share_user_item_param=args.share_param,
-        self.decoder = BiDecoder(args.rating_vals,
+                                 device=args.device)
-                                 in_units=args.gcn_out_units,
+        self.decoder = BiDecoder(in_units=args.gcn_out_units,
-                                 num_basis_functions=args.gen_r_num_basis_func)
+                                 num_classes=len(args.rating_vals),
+                                 num_basis=args.gen_r_num_basis_func)
    def forward(self, enc_graph, dec_graph, ufeat, ifeat):
        user_out, movie_out = self.encoder(

--- a/examples/pytorch/gcmc/train_sampling.py
+++ b/examples/pytorch/gcmc/train_sampling.py