[Chem] ACNN and various utilities (#1117)

* Add several splitting methods * Update * Update * Update * Update * Update * Fix * Update * Update * Update * Update * Fix * Fix * Fix * Fix * Fix * Update * Update * Update * Update * Update * Update * Update * Finally * CI

[Chem] ACNN and various utilities (#1117)
* Add several splitting methods * Update * Update * Update * Update * Update * Fix * Update * Update * Update * Update * Fix * Fix * Fix * Fix * Fix * Update * Update * Update * Update * Update * Update * Update * Finally * CI
cf9ba90f · Mufei Li · GitHub · 6beef85b · cf9ba90f · cf9ba90f
Unverified Commit cf9ba90f authored Dec 19, 2019 by Mufei Li Committed by GitHub Dec 19, 2019
20 changed files
--- a/docs/source/api/python/data.rst
+++ b/docs/source/api/python/data.rst
@@ -142,8 +142,22 @@ Molecular Graphs

 To work on molecular graphs, make sure you have installed `RDKit 2018.09.3 <https://www.rdkit.org/docs/Install.html>`__.

-Featurization Utils
-```````````````````
+Data Loading and Processing Utils
+`````````````````````````````````
+
+We adapt several utilities for processing molecules from
+`DeepChem <https://github.com/deepchem/deepchem/blob/master/deepchem>`__.
+
+.. autosummary::
+    :toctree: ../../generated/
+
+    chem.add_hydrogens_to_mol
+    chem.get_mol_3D_coordinates
+    chem.load_molecule
+    chem.multiprocess_load_molecules
+
+Featurization Utils for Single Molecule
+```````````````````````````````````````

 For the use of graph neural networks, we need to featurize nodes (atoms) and edges (bonds).

@@ -202,8 +216,8 @@ Utils for bond featurization:
    chem.BaseBondFeaturizer.__call__
    chem.CanonicalBondFeaturizer

-Graph Construction
-``````````````````
+Graph Construction for Single Molecule
+``````````````````````````````````````

 Several methods for constructing DGLGraphs from SMILES/RDKit molecule objects are listed below:

@@ -215,6 +229,17 @@ Several methods for constructing DGLGraphs from SMILES/RDKit molecule objects ar
    chem.mol_to_bigraph
    chem.smiles_to_complete_graph
    chem.mol_to_complete_graph
+    chem.k_nearest_neighbors
+
+Graph Construction and Featurization for Ligand-Protein Complex
+```````````````````````````````````````````````````````````````
+
+Constructing DGLHeteroGraphs and featurize for them.
+
+.. autosummary::
+    :toctree: ../../generated/
+
+    chem.ACNN_graph_construction_and_featurization

 Dataset Classes
 ```````````````
@@ -224,11 +249,12 @@ If your dataset is stored in a ``.csv`` file, you may find it helpful to use
 .. autoclass:: dgl.data.chem.CSVDataset
    :members: __getitem__, __len__

-Currently three datasets are supported:
+Currently four datasets are supported:

 * Tox21
 * TencentAlchemyDataset
 * PubChemBioAssayAromaticity
+* PDBBind

 .. autoclass:: dgl.data.chem.Tox21
    :members: __getitem__, __len__, task_pos_weights
@@ -238,3 +264,32 @@ Currently three datasets are supported:

 .. autoclass:: dgl.data.chem.PubChemBioAssayAromaticity
    :members: __getitem__, __len__
+
+.. autoclass:: dgl.data.chem.PDBBind
+    :members: __getitem__, __len__
+
+Dataset Splitting
+`````````````````
+
+We provide support for some common data splitting methods:
+
+* consecutive split
+* random split
+* molecular weight split
+* Bemis-Murcko scaffold split
+* single-task-stratified split
+
+.. autoclass:: dgl.data.chem.ConsecutiveSplitter
+    :members: train_val_test_split, k_fold_split
+
+.. autoclass:: dgl.data.chem.RandomSplitter
+    :members: train_val_test_split, k_fold_split
+
+.. autoclass:: dgl.data.chem.MolecularWeightSplitter
+    :members: train_val_test_split, k_fold_split
+
+.. autoclass:: dgl.data.chem.ScaffoldSplitter
+    :members: train_val_test_split, k_fold_split
+
+.. autoclass:: dgl.data.chem.SingleTaskStratifiedSplitter
+    :members: train_val_test_split, k_fold_split
--- a/docs/source/api/python/model_zoo.rst
+++ b/docs/source/api/python/model_zoo.rst
@@ -59,3 +59,13 @@ Currently supported model architectures:

 .. autoclass:: dgl.model_zoo.chem.DGLJTNNVAE
    :members: forward
+
+Protein Ligand Binding
+``````````````````````
+
+Currently supported model architectures:
+
+* ACNN
+
+.. autoclass:: dgl.model_zoo.chem.ACNN
+    :members: forward
--- a/docs/source/api/python/nn.pytorch.rst
+++ b/docs/source/api/python/nn.pytorch.rst
@@ -115,6 +115,13 @@ NNConv
    :members: forward
    :show-inheritance:

+AtomicConv
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: dgl.nn.pytorch.conv.AtomicConv
+    :members: forward
+    :show-inheritance:
+    
 Dense Conv Layers
 ----------------------------------------

@@ -222,5 +229,3 @@ Edge Softmax

 .. automodule:: dgl.nn.pytorch.softmax
    :members: edge_softmax
-    
-
--- a/examples/pytorch/model_zoo/chem/README.md
+++ b/examples/pytorch/model_zoo/chem/README.md
@@ -4,20 +4,28 @@ With atoms being nodes and bonds being edges, molecular graphs are among the cor
 Deep learning on graphs can be beneficial for various applications in Chemistry like drug and material discovery 
 [1], [2], [12].

-To make it easy for domain scientists, the DGL team releases a model zoo for Chemistry, focusing on two particular cases 
-- property prediction and target generation/optimization. 
+To make it easy for domain scientists, the DGL team releases a model zoo for Chemistry, spanning three cases
+-- property prediction, target generation/optimization and binding affinity prediction.

 With pre-trained models and training scripts, we hope this model zoo will be helpful for both
 the chemistry community and the deep learning community to further their research.

 ## Dependencies

-Before you proceed, make sure you have installed the dependencies below:
+Before you proceed, depending on the model/task you are interested, 
+you may need to install the dependencies below:
+
 - PyTorch 1.2
    - Check the [official website](https://pytorch.org/) for installation guide.
 - RDKit 2018.09.3
    - We recommend installation with `conda install -c conda-forge rdkit==2018.09.3`. For other installation recipes,
    see the [official documentation](https://www.rdkit.org/docs/Install.html).
+- Pdbfixer
+    - We recommend installation with `conda install -c omnia pdbfixer`. To install from source, see the 
+    [manual](http://htmlpreview.github.io/?https://raw.github.com/pandegroup/pdbfixer/master/Manual.html).
+- MDTraj
+    - We recommend installation with `conda install -c conda-forge mdtraj`. For alternative ways of installation, 
+    see the [official documentation](http://mdtraj.org/1.9.3/installation.html).

 The rest dependencies can be installed with `pip install -r requirements.txt`.

@@ -31,13 +39,7 @@ Below we provide some reference numbers to show how DGL improves the speed of tr
 | AttentiveFP on Aromaticity | 6.0                     | 1.2                | 5x          |
 | JTNN on ZINC               | 1826                    | 743                | 2.5x        |   

-## Property Prediction
-
-To evaluate molecules for drug candidates, we need to know their properties and activities. In practice, this is
-mostly achieved via wet lab experiments. We can cast the problem as a regression or classification problem.
-In practice, this can be quite difficult due to the scarcity of labeled data.
-
-### Featurization and Representation Learning
+## Featurization and Representation Learning

 Fingerprint has been a widely used concept in cheminformatics. Chemists developed hand designed rules to convert 
 molecules into binary strings where each bit indicates the presence or absence of a particular substructure. The
@@ -47,6 +49,12 @@ mostly developed based on molecule fingerprints.
 Graph neural networks make it possible for a data-driven representation of molecules out of the atoms, bonds and 
 molecular graph topology, which may be viewed as a learned fingerprint [3]. 

+## Property Prediction
+
+To evaluate molecules for drug candidates, we need to know their properties and activities. In practice, this is
+mostly achieved via wet lab experiments. We can cast the problem as a regression or classification problem.
+In practice, this can be quite difficult due to the scarcity of labeled data.
+
 ### Models
 - **Graph Convolutional Networks** [3], [9]: Graph Convolutional Networks (GCN) have been one of the most popular graph 
 neural networks and they can be easily extended for graph level prediction.
@@ -123,6 +131,17 @@ SVG(Draw.MolsToGridImage(mols, molsPerRow=4, subImgSize=(180, 150), useSVG=True)

 ![](https://s3.us-east-2.amazonaws.com/dgl.ai/model_zoo/drug_discovery/dgmg_model_zoo_example2.png)

+## Binding affinity prediction
+
+The interaction of drugs and proteins can be characterized in terms of binding affinity. Given a pair of ligand 
+(drug candidate) and protein with particular conformations, we are interested in predicting the 
+binding affinity between them. 
+
+### Models
+
+- **Atomic Convolutional Networks** [14]: Constructs nearest neighbor graphs separately for the ligand, protein and complex 
+based on the 3D coordinates of the atoms and predicts the binding free energy.
+
 ## References

 [1] Chen et al. (2018) The rise of deep learning in drug discovery. *Drug Discov Today* 6, 1241-1250.
@@ -159,3 +178,5 @@ Machine Learning* JMLR. 1263-1272.

 [13] Jin et al. (2018) Junction Tree Variational Autoencoder for Molecular Graph Generation. 
 *Proceedings of the 35th International Conference on Machine Learning (ICML)*, 2323-2332.
+
+[14] Gomes et al. (2017) Atomic Convolutional Networks for Predicting Protein-Ligand Binding Affinity. *arXiv preprint arXiv:1703.10603*.
--- a/examples/pytorch/model_zoo/chem/binding_affinity_prediction/README.md
+++ b/examples/pytorch/model_zoo/chem/binding_affinity_prediction/README.md
+# Binding Affinity Prediction
+
+## Datasets
+- **PDBBind**: The PDBBind dataset in MoleculeNet [1] processed from the PDBBind database. The PDBBind 
+database consists of experimentally measured binding affinities for bio-molecular complexes [2], [3]. 
+It provides detailed 3D Cartesian coordinates of both ligands and their target proteins derived from 
+experimental(e.g., X-ray crystallography) measurements. The availability of coordinates of the 
+protein-ligand complexes permits structure-based featurization that is aware of the protein-ligand 
+binding geometry. The authors of [1] use the "refined" and "core" subsets of the database [4], more carefully 
+processed for data artifacts, as additional benchmarking targets.
+
+## Models
+- **Atomic Convolutional Networks (ACNN)** [5]: Constructs nearest neighbor graphs separately for the ligand, protein and complex 
+based on the 3D coordinates of the atoms and predicts the binding free energy.
+
+## Usage
+
+Use `main.py` with arguments
+```
+-m {ACNN}, Model to use
+-d {PDBBind_core_pocket_random, PDBBind_core_pocket_scaffold, PDBBind_core_pocket_stratified, 
+PDBBind_core_pocket_temporal, PDBBind_refined_pocket_random, PDBBind_refined_pocket_scaffold, 
+PDBBind_refined_pocket_stratified, PDBBind_refined_pocket_temporal}, dataset and splitting method to use
+```
+
+## Performance
+
+### PDBBind
+
+#### ACNN
+
+| Subset  | Splitting Method | Test MAE | Test R2 |
+| ------- | ---------------- | -------- | ------- |
+| Core    | Random           | 1.7688   | 0.1511  |
+| Core    | Scaffold         | 2.5420   | 0.1471  |
+| Core    | Stratified       | 1.7419   | 0.1520  |
+| Core    | Temporal         | 1.9543   | 0.1640  |
+| Refined | Random           | 1.1948   | 0.4373  |    
+| Refined | Scaffold         | 1.4021   | 0.2086  |
+| Refined | Stratified       | 1.6376   | 0.3050  |
+| Refined | Temporal         | 1.2457   | 0.3438  |
+
+## Speed
+
+### ACNN
+
+Comparing to the [DeepChem's implementation](https://github.com/joegomes/deepchem/tree/acdc), we achieve a speedup by 
+roughly 3.3 for training time per epoch (from 1.40s to 0.42s). If we do not care about 
+randomness introduced by some kernel optimization, we can achieve a speedup by roughly 4.4 (from 1.40s to 0.32s).
+
+## References
+[1] Wu et al. (2017) MoleculeNet: a benchmark for molecular machine learning. *Chemical Science* 9, 513-530.
+[2] Wang et al. (2004) The PDBbind database: collection of binding affinities for protein-ligand complexes 
+with known three-dimensional structures. *J Med Chem* 3;47(12):2977-80.
+[3] Wang et al. (2005) The PDBbind database: methodologies and updates. *J Med Chem* 16;48(12):4111-9.
+[4] Liu et al. (2015) PDB-wide collection of binding data: current status of the PDBbind database. *Bioinformatics* 1;31(3):405-12.
+[5] Gomes et al. (2017) Atomic Convolutional Networks for Predicting Protein-Ligand Binding Affinity. *arXiv preprint arXiv:1703.10603*.
--- a/examples/pytorch/model_zoo/chem/binding_affinity_prediction/configure.py
+++ b/examples/pytorch/model_zoo/chem/binding_affinity_prediction/configure.py
+import numpy as np
+import torch
+
+ACNN_PDBBind_core_pocket_random = {
+    'dataset': 'PDBBind',
+    'subset': 'core',
+    'load_binding_pocket': True,
+    'random_seed': 123,
+    'frac_train': 0.8,
+    'frac_val': 0.,
+    'frac_test': 0.2,
+    'batch_size': 24,
+    'shuffle': False,
+    'hidden_sizes': [32, 32, 16],
+    'weight_init_stddevs': [1. / float(np.sqrt(32)), 1. / float(np.sqrt(32)),
+                            1. / float(np.sqrt(16)), 0.01],
+    'dropouts': [0., 0., 0.],
+    'atomic_numbers_considered': torch.tensor([
+        1., 6., 7., 8., 9., 11., 12., 15., 16., 17., 20., 25., 30., 35., 53.]),
+    'radial': [[12.0], [0.0, 4.0, 8.0], [4.0]],
+    'lr': 0.001,
+    'num_epochs': 120,
+    'metrics': ['pearson_r2', 'mae'],
+    'split': 'random'
+}
+
+ACNN_PDBBind_core_pocket_scaffold = {
+    'dataset': 'PDBBind',
+    'subset': 'core',
+    'load_binding_pocket': True,
+    'random_seed': 123,
+    'frac_train': 0.8,
+    'frac_val': 0.,
+    'frac_test': 0.2,
+    'batch_size': 24,
+    'shuffle': False,
+    'hidden_sizes': [32, 32, 16],
+    'weight_init_stddevs': [1. / float(np.sqrt(32)), 1. / float(np.sqrt(32)),
+                            1. / float(np.sqrt(16)), 0.01],
+    'dropouts': [0., 0., 0.],
+    'atomic_numbers_considered': torch.tensor([
+        1., 6., 7., 8., 9., 11., 12., 15., 16., 17., 20., 25., 30., 35., 53.]),
+    'radial': [[12.0], [0.0, 4.0, 8.0], [4.0]],
+    'lr': 0.001,
+    'num_epochs': 170,
+    'metrics': ['pearson_r2', 'mae'],
+    'split': 'scaffold'
+}
+
+ACNN_PDBBind_core_pocket_stratified = {
+    'dataset': 'PDBBind',
+    'subset': 'core',
+    'load_binding_pocket': True,
+    'random_seed': 123,
+    'frac_train': 0.8,
+    'frac_val': 0.,
+    'frac_test': 0.2,
+    'batch_size': 24,
+    'shuffle': False,
+    'hidden_sizes': [32, 32, 16],
+    'weight_init_stddevs': [1. / float(np.sqrt(32)), 1. / float(np.sqrt(32)),
+                            1. / float(np.sqrt(16)), 0.01],
+    'dropouts': [0., 0., 0.],
+    'atomic_numbers_considered': torch.tensor([
+        1., 6., 7., 8., 9., 11., 12., 15., 16., 17., 20., 25., 30., 35., 53.]),
+    'radial': [[12.0], [0.0, 4.0, 8.0], [4.0]],
+    'lr': 0.001,
+    'num_epochs': 110,
+    'metrics': ['pearson_r2', 'mae'],
+    'split': 'stratified'
+}
+
+ACNN_PDBBind_core_pocket_temporal = {
+    'dataset': 'PDBBind',
+    'subset': 'core',
+    'load_binding_pocket': True,
+    'random_seed': 123,
+    'frac_train': 0.8,
+    'frac_val': 0.,
+    'frac_test': 0.2,
+    'batch_size': 24,
+    'shuffle': False,
+    'hidden_sizes': [32, 32, 16],
+    'weight_init_stddevs': [1. / float(np.sqrt(32)), 1. / float(np.sqrt(32)),
+                            1. / float(np.sqrt(16)), 0.01],
+    'dropouts': [0., 0., 0.],
+    'atomic_numbers_considered': torch.tensor([
+        1., 6., 7., 8., 9., 11., 12., 15., 16., 17., 20., 25., 30., 35., 53.]),
+    'radial': [[12.0], [0.0, 4.0, 8.0], [4.0]],
+    'lr': 0.001,
+    'num_epochs': 80,
+    'metrics': ['pearson_r2', 'mae'],
+    'split': 'temporal'
+}
+
+ACNN_PDBBind_refined_pocket_random = {
+    'dataset': 'PDBBind',
+    'subset': 'refined',
+    'load_binding_pocket': True,
+    'random_seed': 123,
+    'frac_train': 0.8,
+    'frac_val': 0.,
+    'frac_test': 0.2,
+    'batch_size': 24,
+    'shuffle': False,
+    'hidden_sizes': [128, 128, 64],
+    'weight_init_stddevs': [0.125, 0.125, 0.177, 0.01],
+    'dropouts': [0.4, 0.4, 0.],
+    'atomic_numbers_considered': torch.tensor([
+        1., 6., 7., 8., 9., 11., 12., 15., 16., 17., 19., 20., 25., 26., 27., 28.,
+        29., 30., 34., 35., 38., 48., 53., 55., 80.]),
+    'radial': [[12.0], [0.0, 2.0, 4.0, 6.0, 8.0], [4.0]],
+    'lr': 0.001,
+    'num_epochs': 200,
+    'metrics': ['pearson_r2', 'mae'],
+    'split': 'random'
+}
+
+ACNN_PDBBind_refined_pocket_scaffold = {
+    'dataset': 'PDBBind',
+    'subset': 'refined',
+    'load_binding_pocket': True,
+    'random_seed': 123,
+    'frac_train': 0.8,
+    'frac_val': 0.,
+    'frac_test': 0.2,
+    'batch_size': 24,
+    'shuffle': False,
+    'hidden_sizes': [128, 128, 64],
+    'weight_init_stddevs': [0.125, 0.125, 0.177, 0.01],
+    'dropouts': [0.4, 0.4, 0.],
+    'atomic_numbers_considered': torch.tensor([
+        1., 6., 7., 8., 9., 11., 12., 15., 16., 17., 19., 20., 25., 26., 27., 28.,
+        29., 30., 34., 35., 38., 48., 53., 55., 80.]),
+    'radial': [[12.0], [0.0, 2.0, 4.0, 6.0, 8.0], [4.0]],
+    'lr': 0.001,
+    'num_epochs': 350,
+    'metrics': ['pearson_r2', 'mae'],
+    'split': 'scaffold'
+}
+
+ACNN_PDBBind_refined_pocket_stratified = {
+    'dataset': 'PDBBind',
+    'subset': 'refined',
+    'load_binding_pocket': True,
+    'random_seed': 123,
+    'frac_train': 0.8,
+    'frac_val': 0.,
+    'frac_test': 0.2,
+    'batch_size': 24,
+    'shuffle': False,
+    'hidden_sizes': [128, 128, 64],
+    'weight_init_stddevs': [0.125, 0.125, 0.177, 0.01],
+    'dropouts': [0.4, 0.4, 0.],
+    'atomic_numbers_considered': torch.tensor([
+        1., 6., 7., 8., 9., 11., 12., 15., 16., 17., 19., 20., 25., 26., 27., 28.,
+        29., 30., 34., 35., 38., 48., 53., 55., 80.]),
+    'radial': [[12.0], [0.0, 2.0, 4.0, 6.0, 8.0], [4.0]],
+    'lr': 0.001,
+    'num_epochs': 400,
+    'metrics': ['pearson_r2', 'mae'],
+    'split': 'stratified'
+}
+
+ACNN_PDBBind_refined_pocket_temporal = {
+    'dataset': 'PDBBind',
+    'subset': 'refined',
+    'load_binding_pocket': True,
+    'random_seed': 123,
+    'frac_train': 0.8,
+    'frac_val': 0.,
+    'frac_test': 0.2,
+    'batch_size': 24,
+    'shuffle': False,
+    'hidden_sizes': [128, 128, 64],
+    'weight_init_stddevs': [0.125, 0.125, 0.177, 0.01],
+    'dropouts': [0.4, 0.4, 0.],
+    'atomic_numbers_considered': torch.tensor([
+        1., 6., 7., 8., 9., 11., 12., 15., 16., 17., 19., 20., 25., 26., 27., 28.,
+        29., 30., 34., 35., 38., 48., 53., 55., 80.]),
+    'radial': [[12.0], [0.0, 2.0, 4.0, 6.0, 8.0], [4.0]],
+    'lr': 0.001,
+    'num_epochs': 350,
+    'metrics': ['pearson_r2', 'mae'],
+    'split': 'temporal'
+}
+
+experiment_configures = {
+    'ACNN_PDBBind_core_pocket_random': ACNN_PDBBind_core_pocket_random,
+    'ACNN_PDBBind_core_pocket_scaffold': ACNN_PDBBind_core_pocket_scaffold,
+    'ACNN_PDBBind_core_pocket_stratified': ACNN_PDBBind_core_pocket_stratified,
+    'ACNN_PDBBind_core_pocket_temporal': ACNN_PDBBind_core_pocket_temporal,
+    'ACNN_PDBBind_refined_pocket_random': ACNN_PDBBind_refined_pocket_random,
+    'ACNN_PDBBind_refined_pocket_scaffold': ACNN_PDBBind_refined_pocket_scaffold,
+    'ACNN_PDBBind_refined_pocket_stratified': ACNN_PDBBind_refined_pocket_stratified,
+    'ACNN_PDBBind_refined_pocket_temporal': ACNN_PDBBind_refined_pocket_temporal
+}
+
+def get_exp_configure(exp_name):
+    return experiment_configures[exp_name]
--- a/examples/pytorch/model_zoo/chem/binding_affinity_prediction/main.py
+++ b/examples/pytorch/model_zoo/chem/binding_affinity_prediction/main.py
+import torch
+import torch.nn as nn
+
+from torch.utils.data import DataLoader
+
+from utils import set_random_seed, load_dataset, collate, load_model, Meter
+
+def update_msg_from_scores(msg, scores):
+    for metric, score in scores.items():
+        msg += ', {} {:.4f}'.format(metric, score)
+    return msg
+
+def run_a_train_epoch(args, epoch, model, data_loader,
+                      loss_criterion, optimizer):
+    model.train()
+    train_meter = Meter(args['train_mean'], args['train_std'])
+    epoch_loss = 0
+    for batch_id, batch_data in enumerate(data_loader):
+        indices, ligand_mols, protein_mols, bg, labels = batch_data
+        labels, bg = labels.to(args['device']), bg.to(args['device'])
+        prediction = model(bg)
+        loss = loss_criterion(prediction, (labels - args['train_mean']) / args['train_std'])
+        epoch_loss += loss.data.item() * len(indices)
+        optimizer.zero_grad()
+        loss.backward()
+        optimizer.step()
+        train_meter.update(prediction, labels)
+    avg_loss = epoch_loss / len(data_loader.dataset)
+    total_scores = {metric: train_meter.compute_metric(metric) for metric in args['metrics']}
+    msg = 'epoch {:d}/{:d}, training | loss {:.4f}'.format(
+        epoch + 1, args['num_epochs'], avg_loss)
+    msg = update_msg_from_scores(msg, total_scores)
+    print(msg)
+
+def run_an_eval_epoch(args, model, data_loader):
+    model.eval()
+    eval_meter = Meter(args['train_mean'], args['train_std'])
+    with torch.no_grad():
+        for batch_id, batch_data in enumerate(data_loader):
+            indices, ligand_mols, protein_mols, bg, labels = batch_data
+            labels, bg = labels.to(args['device']), bg.to(args['device'])
+            prediction = model(bg)
+            eval_meter.update(prediction, labels)
+    total_scores = {metric: eval_meter.compute_metric(metric) for metric in args['metrics']}
+    return total_scores
+
+def main(args):
+    args['device'] = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
+    set_random_seed(args['random_seed'])
+
+    dataset, train_set, test_set = load_dataset(args)
+    args['train_mean'] = train_set.labels_mean.to(args['device'])
+    args['train_std'] = train_set.labels_std.to(args['device'])
+    train_loader = DataLoader(dataset=train_set,
+                              batch_size=args['batch_size'],
+                              shuffle=False,
+                              collate_fn=collate)
+    test_loader = DataLoader(dataset=test_set,
+                             batch_size=args['batch_size'],
+                             shuffle=True,
+                             collate_fn=collate)
+
+    model = load_model(args)
+    loss_fn = nn.MSELoss()
+    optimizer = torch.optim.Adam(model.parameters(), lr=args['lr'])
+    model.to(args['device'])
+
+    for epoch in range(args['num_epochs']):
+        run_a_train_epoch(args, epoch, model, train_loader, loss_fn, optimizer)
+
+    test_scores = run_an_eval_epoch(args, model, test_loader)
+    test_msg = update_msg_from_scores('test results', test_scores)
+    print(test_msg)
+
+if __name__ == '__main__':
+    import argparse
+
+    from configure import get_exp_configure
+
+    parser = argparse.ArgumentParser(description='Protein-Ligand Binding Affinity Prediction')
+    parser.add_argument('-m', '--model', type=str, choices=['ACNN'],
+                        help='Model to use')
+    parser.add_argument('-d', '--dataset', type=str,
+                        choices=['PDBBind_core_pocket_random', 'PDBBind_core_pocket_scaffold',
+                                 'PDBBind_core_pocket_stratified', 'PDBBind_core_pocket_temporal',
+                                 'PDBBind_refined_pocket_random', 'PDBBind_refined_pocket_scaffold',
+                                 'PDBBind_refined_pocket_stratified', 'PDBBind_refined_pocket_temporal'],
+                        help='Dataset to use')
+
+    args = parser.parse_args().__dict__
+    args['exp'] = '_'.join([args['model'], args['dataset']])
+    args.update(get_exp_configure(args['exp']))
+
+    main(args)
--- a/examples/pytorch/model_zoo/chem/binding_affinity_prediction/utils.py
+++ b/examples/pytorch/model_zoo/chem/binding_affinity_prediction/utils.py
+import dgl
+import numpy as np
+import random
+import torch
+import torch.nn.functional as F
+
+from dgl import model_zoo
+from dgl.data.chem import PDBBind, RandomSplitter, ScaffoldSplitter, SingleTaskStratifiedSplitter
+from dgl.data.utils import Subset
+from itertools import accumulate
+from scipy.stats import pearsonr
+
+def set_random_seed(seed=0):
+    """Set random seed.
+
+    Parameters
+    ----------
+    seed : int
+        Random seed to use. Default to 0.
+    """
+    torch.backends.cudnn.deterministic = True
+    torch.backends.cudnn.benchmark = False
+    random.seed(seed)
+    np.random.seed(seed)
+    torch.manual_seed(seed)
+    if torch.cuda.is_available():
+        torch.cuda.manual_seed(seed)
+
+def load_dataset(args):
+    """Load the dataset.
+
+    Parameters
+    ----------
+    args : dict
+        Input arguments.
+
+    Returns
+    -------
+    dataset
+        Full dataset.
+    train_set
+        Train subset of the dataset.
+    val_set
+        Validation subset of the dataset.
+    """
+    assert args['dataset'] in ['PDBBind'], 'Unexpected dataset {}'.format(args['dataset'])
+    if args['dataset'] == 'PDBBind':
+        dataset = PDBBind(subset=args['subset'],
+                          load_binding_pocket=args['load_binding_pocket'],
+                          zero_padding=True)
+        # No validation set is used and frac_val = 0.
+        if args['split'] == 'random':
+            train_set, _, test_set = RandomSplitter.train_val_test_split(
+                dataset,
+                frac_train=args['frac_train'],
+                frac_val=args['frac_val'],
+                frac_test=args['frac_test'],
+                random_state=args['random_seed'])
+
+        elif args['split'] == 'scaffold':
+            train_set, _, test_set = ScaffoldSplitter.train_val_test_split(
+                dataset,
+                mols=dataset.ligand_mols,
+                sanitize=False,
+                frac_train=args['frac_train'],
+                frac_val=args['frac_val'],
+                frac_test=args['frac_test'])
+
+        elif args['split'] == 'stratified':
+            train_set, _, test_set = SingleTaskStratifiedSplitter.train_val_test_split(
+                dataset,
+                labels=dataset.labels,
+                task_id=0,
+                frac_train=args['frac_train'],
+                frac_val=args['frac_val'],
+                frac_test=args['frac_test'],
+                random_state=args['random_seed'])
+
+        elif args['split'] == 'temporal':
+            years = dataset.df['release_year'].values.astype(np.float32)
+            indices = np.argsort(years).tolist()
+            frac_list = np.array([args['frac_train'], args['frac_val'], args['frac_test']])
+            num_data = len(dataset)
+            lengths = (num_data * frac_list).astype(int)
+            lengths[-1] = num_data - np.sum(lengths[:-1])
+            train_set, val_set, test_set = [
+                Subset(dataset, list(indices[offset - length:offset]))
+                for offset, length in zip(accumulate(lengths), lengths)]
+
+        else:
+            raise ValueError('Expect the splitting method '
+                             'to be "random" or "scaffold", got {}'.format(args['split']))
+        train_labels = torch.stack([train_set.dataset.labels[i] for i in train_set.indices])
+        train_set.labels_mean = train_labels.mean(dim=0)
+        train_set.labels_std = train_labels.std(dim=0)
+
+    return dataset, train_set, test_set
+
+def collate(data):
+    indices, ligand_mols, protein_mols, graphs, labels = map(list, zip(*data))
+    bg = dgl.batch_hetero(graphs)
+    for nty in bg.ntypes:
+        bg.set_n_initializer(dgl.init.zero_initializer, ntype=nty)
+    for ety in bg.canonical_etypes:
+        bg.set_e_initializer(dgl.init.zero_initializer, etype=ety)
+    labels = torch.stack(labels, dim=0)
+
+    return indices, ligand_mols, protein_mols, bg, labels
+
+def load_model(args):
+    assert args['model'] in ['ACNN'], 'Unexpected model {}'.format(args['model'])
+    if args['model'] == 'ACNN':
+        model = model_zoo.chem.ACNN(hidden_sizes=args['hidden_sizes'],
+                                    weight_init_stddevs=args['weight_init_stddevs'],
+                                    dropouts=args['dropouts'],
+                                    features_to_use=args['atomic_numbers_considered'],
+                                    radial=args['radial'])
+
+    return model
+
+class Meter(object):
+    """Track and summarize model performance on a dataset for (multi-label) prediction.
+
+    Parameters
+    ----------
+    torch.float32 tensor of shape (T)
+        Mean of existing training labels across tasks, T for the number of tasks
+    torch.float32 tensor of shape (T)
+        Std of existing training labels across tasks, T for the number of tasks
+    """
+    def __init__(self, mean=None, std=None):
+        self.y_pred = []
+        self.y_true = []
+
+        if (type(mean) != type(None)) and (type(std) != type(None)):
+            self.mean = mean.cpu()
+            self.std = std.cpu()
+        else:
+            self.mean = None
+            self.std = None
+
+    def update(self, y_pred, y_true):
+        """Update for the result of an iteration
+
+        Parameters
+        ----------
+        y_pred : float32 tensor
+            Predicted molecule labels with shape (B, T),
+            B for batch size and T for the number of tasks
+        y_true : float32 tensor
+            Ground truth molecule labels with shape (B, T)
+        """
+        self.y_pred.append(y_pred.detach().cpu())
+        self.y_true.append(y_true.detach().cpu())
+
+    def _finalize_labels_and_prediction(self):
+        """Concatenate the labels and predictions.
+
+        If normalization was performed on the labels, undo the normalization.
+        """
+        y_pred = torch.cat(self.y_pred, dim=0)
+        y_true = torch.cat(self.y_true, dim=0)
+
+        if (self.mean is not None) and (self.std is not None):
+            # To compensate for the imbalance between labels during training,
+            # we normalize the ground truth labels with training mean and std.
+            # We need to undo that for evaluation.
+            y_pred = y_pred * self.std + self.mean
+
+        return y_pred, y_true
+
+    def pearson_r2(self):
+        """Compute squared Pearson correlation coefficient
+
+        Returns
+        -------
+        float
+        """
+        y_pred, y_true = self._finalize_labels_and_prediction()
+
+        return pearsonr(y_true[:, 0].numpy(), y_pred[:, 0].numpy())[0] ** 2
+
+    def mae(self):
+        """Compute MAE
+
+        Returns
+        -------
+        float
+        """
+        y_pred, y_true = self._finalize_labels_and_prediction()
+
+        return F.l1_loss(y_true, y_pred).data.item()
+
+    def rmse(self):
+        """
+        Compute RMSE
+
+        Returns
+        -------
+        float
+        """
+        y_pred, y_true = self._finalize_labels_and_prediction()
+
+        return np.sqrt(F.mse_loss(y_pred, y_true).cpu().item())
+
+    def compute_metric(self, metric_name):
+        """Compute metric
+
+        Parameters
+        ----------
+        metric_name : str
+            Name for the metric to compute.
+
+        Returns
+        -------
+        float
+            Metric value
+        """
+        assert metric_name in ['pearson_r2', 'mae', 'rmse'], \
+            'Expect metric name to be "pearson_r2", "mae" or "rmse", got {}'.format(metric_name)
+        if metric_name == 'pearson_r2':
+            return self.pearson_r2()
+        if metric_name == 'mae':
+            return self.mae()
+        if metric_name == 'rmse':
+            return self.rmse()
--- a/examples/pytorch/model_zoo/chem/property_prediction/classification.py
+++ b/examples/pytorch/model_zoo/chem/property_prediction/classification.py
@@ -44,8 +44,8 @@ def run_an_eval_epoch(args, model, data_loader):
    return np.mean(eval_meter.compute_metric(args['metric_name']))

 def main(args):
-    args['device'] = "cuda" if torch.cuda.is_available() else "cpu"
-    set_random_seed()
+    args['device'] = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
+    set_random_seed(args['random_seed'])

    # Interchangeable with other datasets
    dataset, train_set, val_set, test_set = load_dataset_for_classification(args)

--- a/examples/pytorch/model_zoo/chem/property_prediction/configure.py
+++ b/examples/pytorch/model_zoo/chem/property_prediction/configure.py
@@ -6,11 +6,14 @@ from functools import partial
 from utils import chirality

 GCN_Tox21 = {
+    'random_seed': 0,
    'batch_size': 128,
    'lr': 1e-3,
    'num_epochs': 100,
    'atom_data_field': 'h',
-    'train_val_test_split': [0.8, 0.1, 0.1],
+    'frac_train': 0.8,
+    'frac_val': 0.1,
+    'frac_test': 0.1,
    'in_feats': 74,
    'gcn_hidden_feats': [64, 64],
    'classifier_hidden_feats': 64,
@@ -20,11 +23,14 @@ GCN_Tox21 = {
 }

 GAT_Tox21 = {
+    'random_seed': 0,
    'batch_size': 128,
    'lr': 1e-3,
    'num_epochs': 100,
    'atom_data_field': 'h',
-    'train_val_test_split': [0.8, 0.1, 0.1],
+    'frac_train': 0.8,
+    'frac_val': 0.1,
+    'frac_test': 0.1,
    'in_feats': 74,
    'gat_hidden_feats': [32, 32],
    'classifier_hidden_feats': 64,
@@ -35,6 +41,7 @@ GAT_Tox21 = {
 }

 MPNN_Alchemy = {
+    'random_seed': 0,
    'batch_size': 16,
    'num_epochs': 250,
    'node_in_feats': 15,
@@ -47,6 +54,7 @@ MPNN_Alchemy = {
 }

 SCHNET_Alchemy = {
+    'random_seed': 0,
    'batch_size': 16,
    'num_epochs': 250,
    'norm': True,
@@ -58,6 +66,7 @@ SCHNET_Alchemy = {
 }

 MGCN_Alchemy = {
+    'random_seed': 0,
    'batch_size': 16,
    'num_epochs': 250,
    'norm': True,
@@ -81,7 +90,9 @@ AttentiveFP_Aromaticity = {
    'lr': 10 ** (-2.5),
    'batch_size': 128,
    'num_epochs': 800,
-    'train_val_test_split': [0.8, 0.1, 0.1],
+    'frac_train': 0.8,
+    'frac_val': 0.1,
+    'frac_test': 0.1,
    'patience': 80,
    'metric_name': 'rmse',
    # Follow the atom featurization in the original work

--- a/examples/pytorch/model_zoo/chem/property_prediction/regression.py
+++ b/examples/pytorch/model_zoo/chem/property_prediction/regression.py
@@ -55,8 +55,8 @@ def run_an_eval_epoch(args, model, data_loader):
    return total_score

 def main(args):
-    args['device'] = "cuda" if torch.cuda.is_available() else "cpu"
-    set_random_seed()
+    args['device'] = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
+    set_random_seed(args['random_seed'])

    train_set, val_set, test_set = load_dataset_for_regression(args)
    train_loader = DataLoader(dataset=train_set,

--- a/examples/pytorch/model_zoo/chem/property_prediction/utils.py
+++ b/examples/pytorch/model_zoo/chem/property_prediction/utils.py
@@ -6,8 +6,7 @@ import torch
 import torch.nn.functional as F

 from dgl import model_zoo
-from dgl.data.chem import one_hot_encoding
-from dgl.data.utils import split_dataset
+from dgl.data.chem import smiles_to_bigraph, one_hot_encoding, RandomSplitter
 from sklearn.metrics import roc_auc_score

 def set_random_seed(seed=0):
@@ -278,8 +277,10 @@ def load_dataset_for_classification(args):
    assert args['dataset'] in ['Tox21']
    if args['dataset'] == 'Tox21':
        from dgl.data.chem import Tox21
-        dataset = Tox21(atom_featurizer=args['atom_featurizer'])
-        train_set, val_set, test_set = split_dataset(dataset, args['train_val_test_split'])
+        dataset = Tox21(smiles_to_bigraph, args['atom_featurizer'])
+        train_set, val_set, test_set = RandomSplitter.train_val_test_split(
+            dataset, frac_train=args['frac_train'], frac_val=args['frac_val'],
+            frac_test=args['frac_test'], random_state=args['random_seed'])

    return dataset, train_set, val_set, test_set

@@ -310,10 +311,12 @@ def load_dataset_for_regression(args):

    if args['dataset'] == 'Aromaticity':
        from dgl.data.chem import PubChemBioAssayAromaticity
-        dataset = PubChemBioAssayAromaticity(atom_featurizer=args['atom_featurizer'],
-                                             bond_featurizer=args['bond_featurizer'])
-        train_set, val_set, test_set = split_dataset(dataset, frac_list=args['train_val_test_split'],
-                                                     shuffle=True, random_state=args['random_seed'])
+        dataset = PubChemBioAssayAromaticity(smiles_to_bigraph,
+                                             args['atom_featurizer'],
+                                             args['bond_featurizer'])
+        train_set, val_set, test_set = RandomSplitter.train_val_test_split(
+            dataset, frac_train=args['frac_train'], frac_val=args['frac_val'],
+            frac_test=args['frac_test'], random_state=args['random_seed'])

    return train_set, val_set, test_set


--- a/python/dgl/data/chem/__init__.py
+++ b/python/dgl/data/chem/__init__.py
+from .datasets import *
 from .utils import *
-from .csv_dataset import MoleculeCSVDataset
-from .tox21 import Tox21
-from .alchemy import TencentAlchemyDataset
-from .pubchem_aromaticity import PubChemBioAssayAromaticity
--- a/python/dgl/data/chem/datasets/__init__.py
+++ b/python/dgl/data/chem/datasets/__init__.py
+from .csv_dataset import MoleculeCSVDataset
+from .tox21 import Tox21
+from .alchemy import TencentAlchemyDataset
+from .pubchem_aromaticity import PubChemBioAssayAromaticity
+from .pdbbind import PDBBind
--- a/python/dgl/data/chem/alchemy.py
+++ b/python/dgl/data/chem/alchemy.py
@@ -6,14 +6,13 @@ import numpy as np
 import os
 import os.path as osp
 import pathlib
-import pickle
 import zipfile
 from collections import defaultdict

-from .utils import mol_to_complete_graph, atom_type_one_hot, atom_hybridization_one_hot, \
-    atom_is_aromatic
-from ..utils import download, get_download_dir, _get_dgl_url, save_graphs, load_graphs
-from ... import backend as F
+from ..utils import mol_to_complete_graph, atom_type_one_hot, \
+    atom_hybridization_one_hot, atom_is_aromatic
+from ...utils import download, get_download_dir, _get_dgl_url, save_graphs, load_graphs
+from .... import backend as F

 try:
    import pandas as pd
@@ -156,11 +155,12 @@ class TencentAlchemyDataset(object):
    mol_to_graph: callable, str -> DGLGraph
        A function turning an RDKit molecule instance into a DGLGraph.
        Default to :func:`dgl.data.chem.mol_to_complete_graph`.
-    atom_featurizer : callable, rdkit.Chem.rdchem.Mol -> dict
-        Featurization for atoms in a molecule, which can be used to update
-        ndata for a DGLGraph. By default, we store the atom atomic numbers
-        under the name ``"node_type"`` and store the atom features under the
-        name ``"n_feat"``. The atom features include:
+    node_featurizer : callable, rdkit.Chem.rdchem.Mol -> dict
+        Featurization for nodes like atoms in a molecule, which can be used to update
+        ndata for a DGLGraph. By default, we construct graphs where nodes represent atoms
+        and node features represent atom features. We store the atomic numbers under the
+        name ``"node_type"`` and store the atom features under the name ``"n_feat"``.
+        The atom features include:
        * One hot encoding for atom types
        * Atomic number of atoms
        * Whether the atom is a donor
@@ -168,16 +168,17 @@ class TencentAlchemyDataset(object):
        * Whether the atom is aromatic
        * One hot encoding for atom hybridization
        * Total number of Hs on the atom
-    bond_featurizer : callable, rdkit.Chem.rdchem.Mol -> dict
-        Featurization for bonds in a molecule, which can be used to update
-        edata for a DGLGraph. By default, we store the distance between the
-        end atoms under the name ``"distance"`` and store the bond features under
-        the name ``"e_feat"``. The bond features are one-hot encodings of the bond type.
+    edge_featurizer : callable, rdkit.Chem.rdchem.Mol -> dict
+        Featurization for edges like bonds in a molecule, which can be used to update
+        edata for a DGLGraph. By default, we construct edges between every pair of atoms,
+        excluding the self loops. We store the distance between the end atoms under the name
+        ``"distance"`` and store the edge features under the name ``"e_feat"``. The edge
+        features represent one hot encoding of edge types (bond types and non-bond edges).
    """
    def __init__(self, mode='dev', from_raw=False,
                 mol_to_graph=mol_to_complete_graph,
-                 atom_featurizer=alchemy_nodes,
-                 bond_featurizer=alchemy_edges):
+                 node_featurizer=alchemy_nodes,
+                 edge_featurizer=alchemy_edges):
        if mode == 'test':
            raise ValueError('The test mode is not supported before '
                             'the Alchemy contest finishes.')
@@ -205,9 +206,9 @@ class TencentAlchemyDataset(object):
            archive.extractall(file_dir)
            archive.close()

-        self._load(mol_to_graph, atom_featurizer, bond_featurizer)
+        self._load(mol_to_graph, node_featurizer, edge_featurizer)

-    def _load(self, mol_to_graph, atom_featurizer, bond_featurizer):
+    def _load(self, mol_to_graph, node_featurizer, edge_featurizer):
        if not self.from_raw:
            self.graphs, label_dict = load_graphs(osp.join(self.file_dir, "%s_graphs.bin" % self.mode))
            self.labels = label_dict['labels']
@@ -230,8 +231,8 @@ class TencentAlchemyDataset(object):
            for mol, label in zip(supp, self.target.iterrows()):
                cnt += 1
                print('Processing molecule {:d}/{:d}'.format(cnt, dataset_size))
-                graph = mol_to_graph(mol, atom_featurizer=atom_featurizer,
-                                     bond_featurizer=bond_featurizer)
+                graph = mol_to_graph(mol, node_featurizer=node_featurizer,
+                                     edge_featurizer=edge_featurizer)
                smiles = Chem.MolToSmiles(mol)
                self.smiles.append(smiles)
                self.graphs.append(graph)

--- a/python/dgl/data/chem/csv_dataset.py
+++ b/python/dgl/data/chem/csv_dataset.py
 from __future__ import absolute_import

-import dgl.backend as F
 import numpy as np
 import os
 import sys

-from ..utils import save_graphs, load_graphs
-from ... import backend as F
+from ...utils import save_graphs, load_graphs
+from .... import backend as F

 class MoleculeCSVDataset(object):
    """MoleculeCSVDataset
@@ -27,21 +26,21 @@ class MoleculeCSVDataset(object):
        Column names other than smiles column would be considered as task names.
    smiles_to_graph: callable, str -> DGLGraph
        A function turning a SMILES into a DGLGraph.
-    atom_featurizer : callable, rdkit.Chem.rdchem.Mol -> dict
-        Featurization for atoms in a molecule, which can be used to update
+    node_featurizer : callable, rdkit.Chem.rdchem.Mol -> dict
+        Featurization for nodes like atoms in a molecule, which can be used to update
        ndata for a DGLGraph.
-    bond_featurizer : callable, rdkit.Chem.rdchem.Mol -> dict
-        Featurization for bonds in a molecule, which can be used to update
+    edge_featurizer : callable, rdkit.Chem.rdchem.Mol -> dict
+        Featurization for edges like bonds in a molecule, which can be used to update
        edata for a DGLGraph.
    smiles_column: str
        Column name that including smiles.
    cache_file_path: str
        Path to store the preprocessed data.
    """
-    def __init__(self, df, smiles_to_graph, atom_featurizer, bond_featurizer,
+    def __init__(self, df, smiles_to_graph, node_featurizer, edge_featurizer,
                 smiles_column, cache_file_path):
        if 'rdkit' not in sys.modules:
-            from ...base import dgl_warning
+            from ....base import dgl_warning
            dgl_warning(
                "Please install RDKit (Recommended Version is 2018.09.3)")
        self.df = df
@@ -49,9 +48,9 @@ class MoleculeCSVDataset(object):
        self.task_names = self.df.columns.drop([smiles_column]).tolist()
        self.n_tasks = len(self.task_names)
        self.cache_file_path = cache_file_path
-        self._pre_process(smiles_to_graph, atom_featurizer, bond_featurizer)
+        self._pre_process(smiles_to_graph, node_featurizer, edge_featurizer)

-    def _pre_process(self, smiles_to_graph, atom_featurizer, bond_featurizer):
+    def _pre_process(self, smiles_to_graph, node_featurizer, edge_featurizer):
        """Pre-process the dataset

        * Convert molecules from smiles format into DGLGraphs
@@ -63,11 +62,11 @@ class MoleculeCSVDataset(object):
        ----------
        smiles_to_graph : callable, SMILES -> DGLGraph
            Function for converting a SMILES (str) into a DGLGraph.
-        atom_featurizer : callable, rdkit.Chem.rdchem.Mol -> dict
-            Featurization for atoms in a molecule, which can be used to update
+        node_featurizer : callable, rdkit.Chem.rdchem.Mol -> dict
+            Featurization for nodes like atoms in a molecule, which can be used to update
            ndata for a DGLGraph.
-        bond_featurizer : callable, rdkit.Chem.rdchem.Mol -> dict
-            Featurization for bonds in a molecule, which can be used to update
+        edge_featurizer : callable, rdkit.Chem.rdchem.Mol -> dict
+            Featurization for edges like bonds in a molecule, which can be used to update
            edata for a DGLGraph.
        """
        if os.path.exists(self.cache_file_path):
@@ -81,8 +80,8 @@ class MoleculeCSVDataset(object):
            self.graphs = []
            for i, s in enumerate(self.smiles):
                print('Processing molecule {:d}/{:d}'.format(i+1, len(self)))
-                self.graphs.append(smiles_to_graph(s, atom_featurizer=atom_featurizer,
-                                                   bond_featurizer=bond_featurizer))
+                self.graphs.append(smiles_to_graph(s, node_featurizer=node_featurizer,
+                                                   edge_featurizer=edge_featurizer))
            _label_values = self.df[self.task_names].values
            # np.nan_to_num will also turn inf into a very large number
            self.labels = F.zerocopy_from_numpy(np.nan_to_num(_label_values).astype(np.float32))

--- a/python/dgl/data/chem/datasets/pdbbind.py
+++ b/python/dgl/data/chem/datasets/pdbbind.py
+"""PDBBind dataset processed by MoleculeNet."""
+import numpy as np
+import os
+import pandas as pd
+
+from ..utils import multiprocess_load_molecules, ACNN_graph_construction_and_featurization
+from ...utils import get_download_dir, download, _get_dgl_url, extract_archive
+from .... import backend as F
+
+class PDBBind(object):
+    """PDBbind dataset processed by MoleculeNet.
+
+    The description below is mainly based on
+    `[1] <https://pubs.rsc.org/en/content/articlelanding/2018/sc/c7sc02664a#cit50>`__.
+    The PDBBind database consists of experimentally measured binding affinities for
+    bio-molecular complexes `[2] <https://www.ncbi.nlm.nih.gov/pubmed/?term=15163179%5Buid%5D>`__,
+    `[3] <https://www.ncbi.nlm.nih.gov/pubmed/?term=15943484%5Buid%5D>`__. It provides detailed
+    3D Cartesian coordinates of both ligands and their target proteins derived from experimental
+    (e.g., X-ray crystallography) measurements. The availability of coordinates of the
+    protein-ligand complexes permits structure-based featurization that is aware of the
+    protein-ligand binding geometry. The authors of
+    `[1] <https://pubs.rsc.org/en/content/articlelanding/2018/sc/c7sc02664a#cit50>`__ use the
+    "refined" and "core" subsets of the database
+    `[4] <https://www.ncbi.nlm.nih.gov/pubmed/?term=25301850%5Buid%5D>`__, more carefully
+    processed for data artifacts, as additional benchmarking targets.
+
+    References:
+
+        * [1] MoleculeNet: a benchmark for molecular machine learning
+        * [2] The PDBbind database: collection of binding affinities for protein-ligand complexes
+        with known three-dimensional structures
+        * [3] The PDBbind database: methodologies and updates
+        * [4] PDB-wide collection of binding data: current status of the PDBbind database
+
+    Parameters
+    ----------
+    subset : str
+        In MoleculeNet, we can use either the "refined" subset or the "core" subset. We can
+        retrieve them by setting ``subset`` to be ``'refined'`` or ``'core'``. The size
+        of the ``'core'`` set is 195 and the size of the ``'refined'`` set is 3706.
+    load_binding_pocket : bool
+        Whether to load binding pockets or full proteins. Default to True.
+    add_hydrogens : bool
+        Whether to add hydrogens via pdbfixer. Default to False.
+    sanitize : bool
+        Whether sanitization is performed in initializing RDKit molecule instances. See
+        https://www.rdkit.org/docs/RDKit_Book.html for details of the sanitization.
+        Default to False.
+    calc_charges : bool
+        Whether to add Gasteiger charges via RDKit. Setting this to be True will enforce
+        ``add_hydrogens`` and ``sanitize`` to be True. Default to False.
+    remove_hs : bool
+        Whether to remove hydrogens via RDKit. Note that removing hydrogens can be quite
+        slow for large molecules. Default to False.
+    use_conformation : bool
+        Whether we need to extract molecular conformation from proteins and ligands.
+        Default to True.
+    construct_graph_and_featurize : callable
+        Construct a DGLHeteroGraph for the use of GNNs. Mapping self.ligand_mols[i],
+        self.protein_mols[i], self.ligand_coordinates[i] and self.protein_coordinates[i]
+        to a DGLHeteroGraph. Default to :func:`ACNN_graph_construction_and_featurization`.
+    zero_padding : bool
+        Whether to perform zero padding. While DGL does not necessarily require zero padding,
+        pooling operations for variable length inputs can introduce stochastic behaviour, which
+        is not desired for sensitive scenarios. Default to True.
+    num_processes : int or None
+        Number of worker processes to use. If None,
+        then we will use the number of CPUs in the system. Default to 64.
+    """
+    def __init__(self, subset, load_binding_pocket=True, add_hydrogens=False,
+                 sanitize=False, calc_charges=False, remove_hs=False, use_conformation=True,
+                 construct_graph_and_featurize=ACNN_graph_construction_and_featurization,
+                 zero_padding=True, num_processes=64):
+        self.task_names = ['-logKd/Ki']
+        self.n_tasks = len(self.task_names)
+
+        self._url = 'dataset/pdbbind_v2015.tar.gz'
+        root_dir_path = get_download_dir()
+        data_path = root_dir_path + '/pdbbind_v2015.tar.gz'
+        extracted_data_path = root_dir_path + '/pdbbind_v2015'
+        download(_get_dgl_url(self._url), path=data_path)
+        extract_archive(data_path, extracted_data_path)
+
+        if subset == 'core':
+            index_label_file = extracted_data_path + '/v2015/INDEX_core_data.2013'
+        elif subset == 'refined':
+            index_label_file = extracted_data_path + '/v2015/INDEX_refined_data.2015'
+        else:
+            raise ValueError(
+                'Expect the subset_choice to be either '
+                'core or refined, got {}'.format(subset))
+
+        self._preprocess(extracted_data_path, index_label_file, load_binding_pocket,
+                         add_hydrogens, sanitize, calc_charges, remove_hs, use_conformation,
+                         construct_graph_and_featurize, zero_padding, num_processes)
+
+    def _filter_out_invalid(self, ligands_loaded, proteins_loaded, use_conformation):
+        """Filter out invalid ligand-protein pairs.
+
+        Parameters
+        ----------
+        ligands_loaded : list
+            Each element is a 2-tuple of the RDKit molecule instance and its associated atom
+            coordinates. None is used to represent invalid/non-existing molecule or coordinates.
+        proteins_loaded : list
+            Each element is a 2-tuple of the RDKit molecule instance and its associated atom
+            coordinates. None is used to represent invalid/non-existing molecule or coordinates.
+        use_conformation : bool
+            Whether we need conformation information (atom coordinates) and filter out molecules
+            without valid conformation.
+        """
+        num_pairs = len(proteins_loaded)
+        self.indices, self.ligand_mols, self.protein_mols = [], [], []
+        if use_conformation:
+            self.ligand_coordinates, self.protein_coordinates = [], []
+        else:
+            # Use None for placeholders.
+            self.ligand_coordinates = [None for _ in range(num_pairs)]
+            self.protein_coordinates = [None for _ in range(num_pairs)]
+
+        for i in range(num_pairs):
+            ligand_mol, ligand_coordinates = ligands_loaded[i]
+            protein_mol, protein_coordinates = proteins_loaded[i]
+            if (not use_conformation) and all(v is not None for v in [protein_mol, ligand_mol]):
+                self.indices.append(i)
+                self.ligand_mols.append(ligand_mol)
+                self.protein_mols.append(protein_mol)
+            elif all(v is not None for v in [
+                protein_mol, protein_coordinates, ligand_mol, ligand_coordinates]):
+                self.indices.append(i)
+                self.ligand_mols.append(ligand_mol)
+                self.ligand_coordinates.append(ligand_coordinates)
+                self.protein_mols.append(protein_mol)
+                self.protein_coordinates.append(protein_coordinates)
+
+    def _preprocess(self, root_path, index_label_file, load_binding_pocket,
+                    add_hydrogens, sanitize, calc_charges, remove_hs, use_conformation,
+                    construct_graph_and_featurize, zero_padding, num_processes):
+        """Preprocess the dataset.
+
+        The pre-processing proceeds as follows:
+
+        1. Load the dataset
+        2. Clean the dataset and filter out invalid pairs
+        3. Construct graphs
+        4. Prepare node and edge features
+
+        Parameters
+        ----------
+        root_path : str
+            Root path for molecule files.
+        index_label_file : str
+            Path to the index file for the dataset.
+        load_binding_pocket : bool
+            Whether to load binding pockets or full proteins.
+        add_hydrogens : bool
+            Whether to add hydrogens via pdbfixer.
+        sanitize : bool
+            Whether sanitization is performed in initializing RDKit molecule instances. See
+            https://www.rdkit.org/docs/RDKit_Book.html for details of the sanitization.
+        calc_charges : bool
+            Whether to add Gasteiger charges via RDKit. Setting this to be True will enforce
+            ``add_hydrogens`` and ``sanitize`` to be True.
+        remove_hs : bool
+            Whether to remove hydrogens via RDKit. Note that removing hydrogens can be quite
+            slow for large molecules.
+        use_conformation : bool
+            Whether we need to extract molecular conformation from proteins and ligands.
+        construct_graph_and_featurize : callable
+            Construct a DGLHeteroGraph for the use of GNNs. Mapping self.ligand_mols[i],
+            self.protein_mols[i], self.ligand_coordinates[i] and self.protein_coordinates[i]
+            to a DGLHeteroGraph. Default to :func:`ACNN_graph_construction_and_featurization`.
+        zero_padding : bool
+            Whether to perform zero padding. While DGL does not necessarily require zero padding,
+            pooling operations for variable length inputs can introduce stochastic behaviour, which
+            is not desired for sensitive scenarios.
+        num_processes : int or None
+            Number of worker processes to use. If None,
+            then we will use the number of CPUs in the system.
+        """
+        contents = []
+        with open(index_label_file, 'r') as f:
+            for line in f.readlines():
+                if line[0] != "#":
+                    splitted_elements = line.split()
+                    if len(splitted_elements) == 8:
+                        # Ignore "//"
+                        contents.append(splitted_elements[:5] + splitted_elements[6:])
+                    else:
+                        print('Incorrect data format.')
+                        print(splitted_elements)
+        self.df = pd.DataFrame(contents, columns=(
+            'PDB_code', 'resolution', 'release_year',
+            '-logKd/Ki', 'Kd/Ki', 'reference', 'ligand_name'))
+        pdbs = self.df['PDB_code'].tolist()
+
+        self.ligand_files = [os.path.join(
+            root_path, 'v2015', pdb, '{}_ligand.sdf'.format(pdb)) for pdb in pdbs]
+        if load_binding_pocket:
+            self.protein_files = [os.path.join(
+                root_path, 'v2015', pdb, '{}_pocket.pdb'.format(pdb)) for pdb in pdbs]
+        else:
+            self.protein_files = [os.path.join(
+                root_path, 'v2015', pdb, '{}_protein.pdb'.format(pdb)) for pdb in pdbs]
+
+        num_processes = min(num_processes, len(pdbs))
+
+        print('Loading ligands...')
+        ligands_loaded = multiprocess_load_molecules(self.ligand_files,
+                                                     add_hydrogens=add_hydrogens,
+                                                     sanitize=sanitize,
+                                                     calc_charges=calc_charges,
+                                                     remove_hs=remove_hs,
+                                                     use_conformation=use_conformation,
+                                                     num_processes=num_processes)
+
+        print('Loading proteins...')
+        proteins_loaded = multiprocess_load_molecules(self.protein_files,
+                                                      add_hydrogens=add_hydrogens,
+                                                      sanitize=sanitize,
+                                                      calc_charges=calc_charges,
+                                                      remove_hs=remove_hs,
+                                                      use_conformation=use_conformation,
+                                                      num_processes=num_processes)
+
+        self._filter_out_invalid(ligands_loaded, proteins_loaded, use_conformation)
+        self.df = self.df.iloc[self.indices]
+        self.labels = F.zerocopy_from_numpy(self.df[self.task_names].values.astype(np.float32))
+        print('Finished cleaning the dataset, '
+              'got {:d}/{:d} valid pairs'.format(len(self), len(pdbs)))
+
+        # Prepare zero padding
+        if zero_padding:
+            max_num_ligand_atoms = 0
+            max_num_protein_atoms = 0
+            for i in range(len(self)):
+                max_num_ligand_atoms = max(
+                    max_num_ligand_atoms, self.ligand_mols[i].GetNumAtoms())
+                max_num_protein_atoms = max(
+                    max_num_protein_atoms, self.protein_mols[i].GetNumAtoms())
+        else:
+            max_num_ligand_atoms = None
+            max_num_protein_atoms = None
+
+        print('Start constructing graphs and featurizing them.')
+        self.graphs = []
+        for i in range(len(self)):
+            print('Constructing and featurizing datapoint {:d}/{:d}'.format(i+1, len(self)))
+            self.graphs.append(construct_graph_and_featurize(
+                self.ligand_mols[i], self.protein_mols[i],
+                self.ligand_coordinates[i], self.protein_coordinates[i],
+                max_num_ligand_atoms, max_num_protein_atoms))
+
+    def __len__(self):
+        """Get the size of the dataset.
+
+        Returns
+        -------
+        int
+            Number of valid ligand-protein pairs in the dataset.
+        """
+        return len(self.indices)
+
+    def __getitem__(self, item):
+        """Get the datapoint associated with the index.
+
+        Parameters
+        ----------
+        item : int
+            Index for the datapoint.
+
+        Returns
+        -------
+        int
+            Index for the datapoint.
+        rdkit.Chem.rdchem.Mol
+            RDKit molecule instance for the ligand molecule.
+        rdkit.Chem.rdchem.Mol
+            RDKit molecule instance for the protein molecule.
+        DGLHeteroGraph
+            Pre-processed DGLHeteroGraph with features extracted.
+        Float32 tensor
+            Label for the datapoint.
+        """
+        return item, self.ligand_mols[item], self.protein_mols[item], \
+               self.graphs[item], self.labels[item]
--- a/python/dgl/data/chem/pubchem_aromaticity.py
+++ b/python/dgl/data/chem/pubchem_aromaticity.py
@@ -2,8 +2,8 @@ import pandas as pd
 import sys

 from .csv_dataset import MoleculeCSVDataset
-from .utils import smiles_to_bigraph
-from ..utils import get_download_dir, download, _get_dgl_url
+from ..utils import smiles_to_bigraph
+from ...utils import get_download_dir, download, _get_dgl_url

 class PubChemBioAssayAromaticity(MoleculeCSVDataset):
    """Subset of PubChem BioAssay Dataset for aromaticity prediction.
@@ -15,12 +15,24 @@ class PubChemBioAssayAromaticity(MoleculeCSVDataset):

    The dataset was constructed by sampling 3945 molecules with 0-40 aromatic atoms from the
    PubChem BioAssay dataset.
+
+    Parameters
+    ----------
+    smiles_to_graph: callable, str -> DGLGraph
+        A function turning smiles into a DGLGraph.
+        Default to :func:`dgl.data.chem.smiles_to_bigraph`.
+    node_featurizer : callable, rdkit.Chem.rdchem.Mol -> dict
+        Featurization for nodes like atoms in a molecule, which can be used to update
+        ndata for a DGLGraph. Default to None.
+    edge_featurizer : callable, rdkit.Chem.rdchem.Mol -> dict
+        Featurization for edges like bonds in a molecule, which can be used to update
+        edata for a DGLGraph. Default to None.
    """
    def __init__(self, smiles_to_graph=smiles_to_bigraph,
-                 atom_featurizer=None,
-                 bond_featurizer=None):
+                 node_featurizer=None,
+                 edge_featurizer=None):
        if 'pandas' not in sys.modules:
-            from ...base import dgl_warning
+            from ....base import dgl_warning
            dgl_warning("Please install pandas")

        self._url = 'dataset/pubchem_bioassay_aromaticity.csv'
@@ -28,5 +40,5 @@ class PubChemBioAssayAromaticity(MoleculeCSVDataset):
        download(_get_dgl_url(self._url), path=data_path)
        df = pd.read_csv(data_path)

-        super(PubChemBioAssayAromaticity, self).__init__(df, smiles_to_graph, atom_featurizer, bond_featurizer,
+        super(PubChemBioAssayAromaticity, self).__init__(df, smiles_to_graph, node_featurizer, edge_featurizer,
                                                         "cano_smiles", "pubchem_aromaticity_dglgraph.bin")
--- a/python/dgl/data/chem/tox21.py
+++ b/python/dgl/data/chem/tox21.py
 import sys

 from .csv_dataset import MoleculeCSVDataset
-from .utils import smiles_to_bigraph
-from ..utils import get_download_dir, download, _get_dgl_url
-from ... import backend as F
+from ..utils import smiles_to_bigraph
+from ...utils import get_download_dir, download, _get_dgl_url
+from .... import backend as F

 try:
    import pandas as pd
@@ -32,18 +32,18 @@ class Tox21(MoleculeCSVDataset):
    smiles_to_graph: callable, str -> DGLGraph
        A function turning smiles into a DGLGraph.
        Default to :func:`dgl.data.chem.smiles_to_bigraph`.
-    atom_featurizer : callable, rdkit.Chem.rdchem.Mol -> dict
-        Featurization for atoms in a molecule, which can be used to update
+    node_featurizer : callable, rdkit.Chem.rdchem.Mol -> dict
+        Featurization for nodes like atoms in a molecule, which can be used to update
        ndata for a DGLGraph. Default to None.
-    bond_featurizer : callable, rdkit.Chem.rdchem.Mol -> dict
-        Featurization for bonds in a molecule, which can be used to update
+    edge_featurizer : callable, rdkit.Chem.rdchem.Mol -> dict
+        Featurization for edges like bonds in a molecule, which can be used to update
        edata for a DGLGraph. Default to None.
    """
    def __init__(self, smiles_to_graph=smiles_to_bigraph,
-                 atom_featurizer=None,
-                 bond_featurizer=None):
+                 node_featurizer=None,
+                 edge_featurizer=None):
        if 'pandas' not in sys.modules:
-            from ...base import dgl_warning
+            from ....base import dgl_warning
            dgl_warning("Please install pandas")

        self._url = 'dataset/tox21.csv.gz'
@@ -54,7 +54,7 @@ class Tox21(MoleculeCSVDataset):

        df = df.drop(columns=['mol_id'])

-        super(Tox21, self).__init__(df, smiles_to_graph, atom_featurizer, bond_featurizer,
+        super(Tox21, self).__init__(df, smiles_to_graph, node_featurizer, edge_featurizer,
                                    "smiles", "tox21_dglgraph.bin")
        self._weight_balancing()


--- a/python/dgl/data/chem/utils/__init__.py
+++ b/python/dgl/data/chem/utils/__init__.py
+from .splitters import *
+from .featurizers import *
+from .mol_to_graph import *
+from .complex_to_graph import *
+from .rdkit_utils import *