Unverified Commit cf9ba90f authored by Mufei Li's avatar Mufei Li Committed by GitHub
Browse files

[Chem] ACNN and various utilities (#1117)

* Add several splitting methods

* Update

* Update

* Update

* Update

* Update

* Fix

* Update

* Update

* Update

* Update

* Fix

* Fix

* Fix

* Fix

* Fix

* Update

* Update

* Update

* Update

* Update

* Update

* Update

* Finally

* CI
parent 6beef85b
......@@ -142,8 +142,22 @@ Molecular Graphs
To work on molecular graphs, make sure you have installed `RDKit 2018.09.3 <https://www.rdkit.org/docs/Install.html>`__.
Featurization Utils
```````````````````
Data Loading and Processing Utils
`````````````````````````````````
We adapt several utilities for processing molecules from
`DeepChem <https://github.com/deepchem/deepchem/blob/master/deepchem>`__.
.. autosummary::
:toctree: ../../generated/
chem.add_hydrogens_to_mol
chem.get_mol_3D_coordinates
chem.load_molecule
chem.multiprocess_load_molecules
Featurization Utils for Single Molecule
```````````````````````````````````````
For the use of graph neural networks, we need to featurize nodes (atoms) and edges (bonds).
......@@ -202,8 +216,8 @@ Utils for bond featurization:
chem.BaseBondFeaturizer.__call__
chem.CanonicalBondFeaturizer
Graph Construction
``````````````````
Graph Construction for Single Molecule
``````````````````````````````````````
Several methods for constructing DGLGraphs from SMILES/RDKit molecule objects are listed below:
......@@ -215,6 +229,17 @@ Several methods for constructing DGLGraphs from SMILES/RDKit molecule objects ar
chem.mol_to_bigraph
chem.smiles_to_complete_graph
chem.mol_to_complete_graph
chem.k_nearest_neighbors
Graph Construction and Featurization for Ligand-Protein Complex
```````````````````````````````````````````````````````````````
Constructing DGLHeteroGraphs and featurize for them.
.. autosummary::
:toctree: ../../generated/
chem.ACNN_graph_construction_and_featurization
Dataset Classes
```````````````
......@@ -224,11 +249,12 @@ If your dataset is stored in a ``.csv`` file, you may find it helpful to use
.. autoclass:: dgl.data.chem.CSVDataset
:members: __getitem__, __len__
Currently three datasets are supported:
Currently four datasets are supported:
* Tox21
* TencentAlchemyDataset
* PubChemBioAssayAromaticity
* PDBBind
.. autoclass:: dgl.data.chem.Tox21
:members: __getitem__, __len__, task_pos_weights
......@@ -238,3 +264,32 @@ Currently three datasets are supported:
.. autoclass:: dgl.data.chem.PubChemBioAssayAromaticity
:members: __getitem__, __len__
.. autoclass:: dgl.data.chem.PDBBind
:members: __getitem__, __len__
Dataset Splitting
`````````````````
We provide support for some common data splitting methods:
* consecutive split
* random split
* molecular weight split
* Bemis-Murcko scaffold split
* single-task-stratified split
.. autoclass:: dgl.data.chem.ConsecutiveSplitter
:members: train_val_test_split, k_fold_split
.. autoclass:: dgl.data.chem.RandomSplitter
:members: train_val_test_split, k_fold_split
.. autoclass:: dgl.data.chem.MolecularWeightSplitter
:members: train_val_test_split, k_fold_split
.. autoclass:: dgl.data.chem.ScaffoldSplitter
:members: train_val_test_split, k_fold_split
.. autoclass:: dgl.data.chem.SingleTaskStratifiedSplitter
:members: train_val_test_split, k_fold_split
......@@ -59,3 +59,13 @@ Currently supported model architectures:
.. autoclass:: dgl.model_zoo.chem.DGLJTNNVAE
:members: forward
Protein Ligand Binding
``````````````````````
Currently supported model architectures:
* ACNN
.. autoclass:: dgl.model_zoo.chem.ACNN
:members: forward
......@@ -115,6 +115,13 @@ NNConv
:members: forward
:show-inheritance:
AtomicConv
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: dgl.nn.pytorch.conv.AtomicConv
:members: forward
:show-inheritance:
Dense Conv Layers
----------------------------------------
......@@ -222,5 +229,3 @@ Edge Softmax
.. automodule:: dgl.nn.pytorch.softmax
:members: edge_softmax
......@@ -4,20 +4,28 @@ With atoms being nodes and bonds being edges, molecular graphs are among the cor
Deep learning on graphs can be beneficial for various applications in Chemistry like drug and material discovery
[1], [2], [12].
To make it easy for domain scientists, the DGL team releases a model zoo for Chemistry, focusing on two particular cases
-- property prediction and target generation/optimization.
To make it easy for domain scientists, the DGL team releases a model zoo for Chemistry, spanning three cases
-- property prediction, target generation/optimization and binding affinity prediction.
With pre-trained models and training scripts, we hope this model zoo will be helpful for both
the chemistry community and the deep learning community to further their research.
## Dependencies
Before you proceed, make sure you have installed the dependencies below:
Before you proceed, depending on the model/task you are interested,
you may need to install the dependencies below:
- PyTorch 1.2
- Check the [official website](https://pytorch.org/) for installation guide.
- RDKit 2018.09.3
- We recommend installation with `conda install -c conda-forge rdkit==2018.09.3`. For other installation recipes,
see the [official documentation](https://www.rdkit.org/docs/Install.html).
- Pdbfixer
- We recommend installation with `conda install -c omnia pdbfixer`. To install from source, see the
[manual](http://htmlpreview.github.io/?https://raw.github.com/pandegroup/pdbfixer/master/Manual.html).
- MDTraj
- We recommend installation with `conda install -c conda-forge mdtraj`. For alternative ways of installation,
see the [official documentation](http://mdtraj.org/1.9.3/installation.html).
The rest dependencies can be installed with `pip install -r requirements.txt`.
......@@ -31,13 +39,7 @@ Below we provide some reference numbers to show how DGL improves the speed of tr
| AttentiveFP on Aromaticity | 6.0 | 1.2 | 5x |
| JTNN on ZINC | 1826 | 743 | 2.5x |
## Property Prediction
To evaluate molecules for drug candidates, we need to know their properties and activities. In practice, this is
mostly achieved via wet lab experiments. We can cast the problem as a regression or classification problem.
In practice, this can be quite difficult due to the scarcity of labeled data.
### Featurization and Representation Learning
## Featurization and Representation Learning
Fingerprint has been a widely used concept in cheminformatics. Chemists developed hand designed rules to convert
molecules into binary strings where each bit indicates the presence or absence of a particular substructure. The
......@@ -47,6 +49,12 @@ mostly developed based on molecule fingerprints.
Graph neural networks make it possible for a data-driven representation of molecules out of the atoms, bonds and
molecular graph topology, which may be viewed as a learned fingerprint [3].
## Property Prediction
To evaluate molecules for drug candidates, we need to know their properties and activities. In practice, this is
mostly achieved via wet lab experiments. We can cast the problem as a regression or classification problem.
In practice, this can be quite difficult due to the scarcity of labeled data.
### Models
- **Graph Convolutional Networks** [3], [9]: Graph Convolutional Networks (GCN) have been one of the most popular graph
neural networks and they can be easily extended for graph level prediction.
......@@ -123,6 +131,17 @@ SVG(Draw.MolsToGridImage(mols, molsPerRow=4, subImgSize=(180, 150), useSVG=True)
![](https://s3.us-east-2.amazonaws.com/dgl.ai/model_zoo/drug_discovery/dgmg_model_zoo_example2.png)
## Binding affinity prediction
The interaction of drugs and proteins can be characterized in terms of binding affinity. Given a pair of ligand
(drug candidate) and protein with particular conformations, we are interested in predicting the
binding affinity between them.
### Models
- **Atomic Convolutional Networks** [14]: Constructs nearest neighbor graphs separately for the ligand, protein and complex
based on the 3D coordinates of the atoms and predicts the binding free energy.
## References
[1] Chen et al. (2018) The rise of deep learning in drug discovery. *Drug Discov Today* 6, 1241-1250.
......@@ -159,3 +178,5 @@ Machine Learning* JMLR. 1263-1272.
[13] Jin et al. (2018) Junction Tree Variational Autoencoder for Molecular Graph Generation.
*Proceedings of the 35th International Conference on Machine Learning (ICML)*, 2323-2332.
[14] Gomes et al. (2017) Atomic Convolutional Networks for Predicting Protein-Ligand Binding Affinity. *arXiv preprint arXiv:1703.10603*.
# Binding Affinity Prediction
## Datasets
- **PDBBind**: The PDBBind dataset in MoleculeNet [1] processed from the PDBBind database. The PDBBind
database consists of experimentally measured binding affinities for bio-molecular complexes [2], [3].
It provides detailed 3D Cartesian coordinates of both ligands and their target proteins derived from
experimental(e.g., X-ray crystallography) measurements. The availability of coordinates of the
protein-ligand complexes permits structure-based featurization that is aware of the protein-ligand
binding geometry. The authors of [1] use the "refined" and "core" subsets of the database [4], more carefully
processed for data artifacts, as additional benchmarking targets.
## Models
- **Atomic Convolutional Networks (ACNN)** [5]: Constructs nearest neighbor graphs separately for the ligand, protein and complex
based on the 3D coordinates of the atoms and predicts the binding free energy.
## Usage
Use `main.py` with arguments
```
-m {ACNN}, Model to use
-d {PDBBind_core_pocket_random, PDBBind_core_pocket_scaffold, PDBBind_core_pocket_stratified,
PDBBind_core_pocket_temporal, PDBBind_refined_pocket_random, PDBBind_refined_pocket_scaffold,
PDBBind_refined_pocket_stratified, PDBBind_refined_pocket_temporal}, dataset and splitting method to use
```
## Performance
### PDBBind
#### ACNN
| Subset | Splitting Method | Test MAE | Test R2 |
| ------- | ---------------- | -------- | ------- |
| Core | Random | 1.7688 | 0.1511 |
| Core | Scaffold | 2.5420 | 0.1471 |
| Core | Stratified | 1.7419 | 0.1520 |
| Core | Temporal | 1.9543 | 0.1640 |
| Refined | Random | 1.1948 | 0.4373 |
| Refined | Scaffold | 1.4021 | 0.2086 |
| Refined | Stratified | 1.6376 | 0.3050 |
| Refined | Temporal | 1.2457 | 0.3438 |
## Speed
### ACNN
Comparing to the [DeepChem's implementation](https://github.com/joegomes/deepchem/tree/acdc), we achieve a speedup by
roughly 3.3 for training time per epoch (from 1.40s to 0.42s). If we do not care about
randomness introduced by some kernel optimization, we can achieve a speedup by roughly 4.4 (from 1.40s to 0.32s).
## References
[1] Wu et al. (2017) MoleculeNet: a benchmark for molecular machine learning. *Chemical Science* 9, 513-530.
[2] Wang et al. (2004) The PDBbind database: collection of binding affinities for protein-ligand complexes
with known three-dimensional structures. *J Med Chem* 3;47(12):2977-80.
[3] Wang et al. (2005) The PDBbind database: methodologies and updates. *J Med Chem* 16;48(12):4111-9.
[4] Liu et al. (2015) PDB-wide collection of binding data: current status of the PDBbind database. *Bioinformatics* 1;31(3):405-12.
[5] Gomes et al. (2017) Atomic Convolutional Networks for Predicting Protein-Ligand Binding Affinity. *arXiv preprint arXiv:1703.10603*.
import numpy as np
import torch
ACNN_PDBBind_core_pocket_random = {
'dataset': 'PDBBind',
'subset': 'core',
'load_binding_pocket': True,
'random_seed': 123,
'frac_train': 0.8,
'frac_val': 0.,
'frac_test': 0.2,
'batch_size': 24,
'shuffle': False,
'hidden_sizes': [32, 32, 16],
'weight_init_stddevs': [1. / float(np.sqrt(32)), 1. / float(np.sqrt(32)),
1. / float(np.sqrt(16)), 0.01],
'dropouts': [0., 0., 0.],
'atomic_numbers_considered': torch.tensor([
1., 6., 7., 8., 9., 11., 12., 15., 16., 17., 20., 25., 30., 35., 53.]),
'radial': [[12.0], [0.0, 4.0, 8.0], [4.0]],
'lr': 0.001,
'num_epochs': 120,
'metrics': ['pearson_r2', 'mae'],
'split': 'random'
}
ACNN_PDBBind_core_pocket_scaffold = {
'dataset': 'PDBBind',
'subset': 'core',
'load_binding_pocket': True,
'random_seed': 123,
'frac_train': 0.8,
'frac_val': 0.,
'frac_test': 0.2,
'batch_size': 24,
'shuffle': False,
'hidden_sizes': [32, 32, 16],
'weight_init_stddevs': [1. / float(np.sqrt(32)), 1. / float(np.sqrt(32)),
1. / float(np.sqrt(16)), 0.01],
'dropouts': [0., 0., 0.],
'atomic_numbers_considered': torch.tensor([
1., 6., 7., 8., 9., 11., 12., 15., 16., 17., 20., 25., 30., 35., 53.]),
'radial': [[12.0], [0.0, 4.0, 8.0], [4.0]],
'lr': 0.001,
'num_epochs': 170,
'metrics': ['pearson_r2', 'mae'],
'split': 'scaffold'
}
ACNN_PDBBind_core_pocket_stratified = {
'dataset': 'PDBBind',
'subset': 'core',
'load_binding_pocket': True,
'random_seed': 123,
'frac_train': 0.8,
'frac_val': 0.,
'frac_test': 0.2,
'batch_size': 24,
'shuffle': False,
'hidden_sizes': [32, 32, 16],
'weight_init_stddevs': [1. / float(np.sqrt(32)), 1. / float(np.sqrt(32)),
1. / float(np.sqrt(16)), 0.01],
'dropouts': [0., 0., 0.],
'atomic_numbers_considered': torch.tensor([
1., 6., 7., 8., 9., 11., 12., 15., 16., 17., 20., 25., 30., 35., 53.]),
'radial': [[12.0], [0.0, 4.0, 8.0], [4.0]],
'lr': 0.001,
'num_epochs': 110,
'metrics': ['pearson_r2', 'mae'],
'split': 'stratified'
}
ACNN_PDBBind_core_pocket_temporal = {
'dataset': 'PDBBind',
'subset': 'core',
'load_binding_pocket': True,
'random_seed': 123,
'frac_train': 0.8,
'frac_val': 0.,
'frac_test': 0.2,
'batch_size': 24,
'shuffle': False,
'hidden_sizes': [32, 32, 16],
'weight_init_stddevs': [1. / float(np.sqrt(32)), 1. / float(np.sqrt(32)),
1. / float(np.sqrt(16)), 0.01],
'dropouts': [0., 0., 0.],
'atomic_numbers_considered': torch.tensor([
1., 6., 7., 8., 9., 11., 12., 15., 16., 17., 20., 25., 30., 35., 53.]),
'radial': [[12.0], [0.0, 4.0, 8.0], [4.0]],
'lr': 0.001,
'num_epochs': 80,
'metrics': ['pearson_r2', 'mae'],
'split': 'temporal'
}
ACNN_PDBBind_refined_pocket_random = {
'dataset': 'PDBBind',
'subset': 'refined',
'load_binding_pocket': True,
'random_seed': 123,
'frac_train': 0.8,
'frac_val': 0.,
'frac_test': 0.2,
'batch_size': 24,
'shuffle': False,
'hidden_sizes': [128, 128, 64],
'weight_init_stddevs': [0.125, 0.125, 0.177, 0.01],
'dropouts': [0.4, 0.4, 0.],
'atomic_numbers_considered': torch.tensor([
1., 6., 7., 8., 9., 11., 12., 15., 16., 17., 19., 20., 25., 26., 27., 28.,
29., 30., 34., 35., 38., 48., 53., 55., 80.]),
'radial': [[12.0], [0.0, 2.0, 4.0, 6.0, 8.0], [4.0]],
'lr': 0.001,
'num_epochs': 200,
'metrics': ['pearson_r2', 'mae'],
'split': 'random'
}
ACNN_PDBBind_refined_pocket_scaffold = {
'dataset': 'PDBBind',
'subset': 'refined',
'load_binding_pocket': True,
'random_seed': 123,
'frac_train': 0.8,
'frac_val': 0.,
'frac_test': 0.2,
'batch_size': 24,
'shuffle': False,
'hidden_sizes': [128, 128, 64],
'weight_init_stddevs': [0.125, 0.125, 0.177, 0.01],
'dropouts': [0.4, 0.4, 0.],
'atomic_numbers_considered': torch.tensor([
1., 6., 7., 8., 9., 11., 12., 15., 16., 17., 19., 20., 25., 26., 27., 28.,
29., 30., 34., 35., 38., 48., 53., 55., 80.]),
'radial': [[12.0], [0.0, 2.0, 4.0, 6.0, 8.0], [4.0]],
'lr': 0.001,
'num_epochs': 350,
'metrics': ['pearson_r2', 'mae'],
'split': 'scaffold'
}
ACNN_PDBBind_refined_pocket_stratified = {
'dataset': 'PDBBind',
'subset': 'refined',
'load_binding_pocket': True,
'random_seed': 123,
'frac_train': 0.8,
'frac_val': 0.,
'frac_test': 0.2,
'batch_size': 24,
'shuffle': False,
'hidden_sizes': [128, 128, 64],
'weight_init_stddevs': [0.125, 0.125, 0.177, 0.01],
'dropouts': [0.4, 0.4, 0.],
'atomic_numbers_considered': torch.tensor([
1., 6., 7., 8., 9., 11., 12., 15., 16., 17., 19., 20., 25., 26., 27., 28.,
29., 30., 34., 35., 38., 48., 53., 55., 80.]),
'radial': [[12.0], [0.0, 2.0, 4.0, 6.0, 8.0], [4.0]],
'lr': 0.001,
'num_epochs': 400,
'metrics': ['pearson_r2', 'mae'],
'split': 'stratified'
}
ACNN_PDBBind_refined_pocket_temporal = {
'dataset': 'PDBBind',
'subset': 'refined',
'load_binding_pocket': True,
'random_seed': 123,
'frac_train': 0.8,
'frac_val': 0.,
'frac_test': 0.2,
'batch_size': 24,
'shuffle': False,
'hidden_sizes': [128, 128, 64],
'weight_init_stddevs': [0.125, 0.125, 0.177, 0.01],
'dropouts': [0.4, 0.4, 0.],
'atomic_numbers_considered': torch.tensor([
1., 6., 7., 8., 9., 11., 12., 15., 16., 17., 19., 20., 25., 26., 27., 28.,
29., 30., 34., 35., 38., 48., 53., 55., 80.]),
'radial': [[12.0], [0.0, 2.0, 4.0, 6.0, 8.0], [4.0]],
'lr': 0.001,
'num_epochs': 350,
'metrics': ['pearson_r2', 'mae'],
'split': 'temporal'
}
experiment_configures = {
'ACNN_PDBBind_core_pocket_random': ACNN_PDBBind_core_pocket_random,
'ACNN_PDBBind_core_pocket_scaffold': ACNN_PDBBind_core_pocket_scaffold,
'ACNN_PDBBind_core_pocket_stratified': ACNN_PDBBind_core_pocket_stratified,
'ACNN_PDBBind_core_pocket_temporal': ACNN_PDBBind_core_pocket_temporal,
'ACNN_PDBBind_refined_pocket_random': ACNN_PDBBind_refined_pocket_random,
'ACNN_PDBBind_refined_pocket_scaffold': ACNN_PDBBind_refined_pocket_scaffold,
'ACNN_PDBBind_refined_pocket_stratified': ACNN_PDBBind_refined_pocket_stratified,
'ACNN_PDBBind_refined_pocket_temporal': ACNN_PDBBind_refined_pocket_temporal
}
def get_exp_configure(exp_name):
return experiment_configures[exp_name]
import torch
import torch.nn as nn
from torch.utils.data import DataLoader
from utils import set_random_seed, load_dataset, collate, load_model, Meter
def update_msg_from_scores(msg, scores):
for metric, score in scores.items():
msg += ', {} {:.4f}'.format(metric, score)
return msg
def run_a_train_epoch(args, epoch, model, data_loader,
loss_criterion, optimizer):
model.train()
train_meter = Meter(args['train_mean'], args['train_std'])
epoch_loss = 0
for batch_id, batch_data in enumerate(data_loader):
indices, ligand_mols, protein_mols, bg, labels = batch_data
labels, bg = labels.to(args['device']), bg.to(args['device'])
prediction = model(bg)
loss = loss_criterion(prediction, (labels - args['train_mean']) / args['train_std'])
epoch_loss += loss.data.item() * len(indices)
optimizer.zero_grad()
loss.backward()
optimizer.step()
train_meter.update(prediction, labels)
avg_loss = epoch_loss / len(data_loader.dataset)
total_scores = {metric: train_meter.compute_metric(metric) for metric in args['metrics']}
msg = 'epoch {:d}/{:d}, training | loss {:.4f}'.format(
epoch + 1, args['num_epochs'], avg_loss)
msg = update_msg_from_scores(msg, total_scores)
print(msg)
def run_an_eval_epoch(args, model, data_loader):
model.eval()
eval_meter = Meter(args['train_mean'], args['train_std'])
with torch.no_grad():
for batch_id, batch_data in enumerate(data_loader):
indices, ligand_mols, protein_mols, bg, labels = batch_data
labels, bg = labels.to(args['device']), bg.to(args['device'])
prediction = model(bg)
eval_meter.update(prediction, labels)
total_scores = {metric: eval_meter.compute_metric(metric) for metric in args['metrics']}
return total_scores
def main(args):
args['device'] = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
set_random_seed(args['random_seed'])
dataset, train_set, test_set = load_dataset(args)
args['train_mean'] = train_set.labels_mean.to(args['device'])
args['train_std'] = train_set.labels_std.to(args['device'])
train_loader = DataLoader(dataset=train_set,
batch_size=args['batch_size'],
shuffle=False,
collate_fn=collate)
test_loader = DataLoader(dataset=test_set,
batch_size=args['batch_size'],
shuffle=True,
collate_fn=collate)
model = load_model(args)
loss_fn = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=args['lr'])
model.to(args['device'])
for epoch in range(args['num_epochs']):
run_a_train_epoch(args, epoch, model, train_loader, loss_fn, optimizer)
test_scores = run_an_eval_epoch(args, model, test_loader)
test_msg = update_msg_from_scores('test results', test_scores)
print(test_msg)
if __name__ == '__main__':
import argparse
from configure import get_exp_configure
parser = argparse.ArgumentParser(description='Protein-Ligand Binding Affinity Prediction')
parser.add_argument('-m', '--model', type=str, choices=['ACNN'],
help='Model to use')
parser.add_argument('-d', '--dataset', type=str,
choices=['PDBBind_core_pocket_random', 'PDBBind_core_pocket_scaffold',
'PDBBind_core_pocket_stratified', 'PDBBind_core_pocket_temporal',
'PDBBind_refined_pocket_random', 'PDBBind_refined_pocket_scaffold',
'PDBBind_refined_pocket_stratified', 'PDBBind_refined_pocket_temporal'],
help='Dataset to use')
args = parser.parse_args().__dict__
args['exp'] = '_'.join([args['model'], args['dataset']])
args.update(get_exp_configure(args['exp']))
main(args)
import dgl
import numpy as np
import random
import torch
import torch.nn.functional as F
from dgl import model_zoo
from dgl.data.chem import PDBBind, RandomSplitter, ScaffoldSplitter, SingleTaskStratifiedSplitter
from dgl.data.utils import Subset
from itertools import accumulate
from scipy.stats import pearsonr
def set_random_seed(seed=0):
"""Set random seed.
Parameters
----------
seed : int
Random seed to use. Default to 0.
"""
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
if torch.cuda.is_available():
torch.cuda.manual_seed(seed)
def load_dataset(args):
"""Load the dataset.
Parameters
----------
args : dict
Input arguments.
Returns
-------
dataset
Full dataset.
train_set
Train subset of the dataset.
val_set
Validation subset of the dataset.
"""
assert args['dataset'] in ['PDBBind'], 'Unexpected dataset {}'.format(args['dataset'])
if args['dataset'] == 'PDBBind':
dataset = PDBBind(subset=args['subset'],
load_binding_pocket=args['load_binding_pocket'],
zero_padding=True)
# No validation set is used and frac_val = 0.
if args['split'] == 'random':
train_set, _, test_set = RandomSplitter.train_val_test_split(
dataset,
frac_train=args['frac_train'],
frac_val=args['frac_val'],
frac_test=args['frac_test'],
random_state=args['random_seed'])
elif args['split'] == 'scaffold':
train_set, _, test_set = ScaffoldSplitter.train_val_test_split(
dataset,
mols=dataset.ligand_mols,
sanitize=False,
frac_train=args['frac_train'],
frac_val=args['frac_val'],
frac_test=args['frac_test'])
elif args['split'] == 'stratified':
train_set, _, test_set = SingleTaskStratifiedSplitter.train_val_test_split(
dataset,
labels=dataset.labels,
task_id=0,
frac_train=args['frac_train'],
frac_val=args['frac_val'],
frac_test=args['frac_test'],
random_state=args['random_seed'])
elif args['split'] == 'temporal':
years = dataset.df['release_year'].values.astype(np.float32)
indices = np.argsort(years).tolist()
frac_list = np.array([args['frac_train'], args['frac_val'], args['frac_test']])
num_data = len(dataset)
lengths = (num_data * frac_list).astype(int)
lengths[-1] = num_data - np.sum(lengths[:-1])
train_set, val_set, test_set = [
Subset(dataset, list(indices[offset - length:offset]))
for offset, length in zip(accumulate(lengths), lengths)]
else:
raise ValueError('Expect the splitting method '
'to be "random" or "scaffold", got {}'.format(args['split']))
train_labels = torch.stack([train_set.dataset.labels[i] for i in train_set.indices])
train_set.labels_mean = train_labels.mean(dim=0)
train_set.labels_std = train_labels.std(dim=0)
return dataset, train_set, test_set
def collate(data):
indices, ligand_mols, protein_mols, graphs, labels = map(list, zip(*data))
bg = dgl.batch_hetero(graphs)
for nty in bg.ntypes:
bg.set_n_initializer(dgl.init.zero_initializer, ntype=nty)
for ety in bg.canonical_etypes:
bg.set_e_initializer(dgl.init.zero_initializer, etype=ety)
labels = torch.stack(labels, dim=0)
return indices, ligand_mols, protein_mols, bg, labels
def load_model(args):
assert args['model'] in ['ACNN'], 'Unexpected model {}'.format(args['model'])
if args['model'] == 'ACNN':
model = model_zoo.chem.ACNN(hidden_sizes=args['hidden_sizes'],
weight_init_stddevs=args['weight_init_stddevs'],
dropouts=args['dropouts'],
features_to_use=args['atomic_numbers_considered'],
radial=args['radial'])
return model
class Meter(object):
"""Track and summarize model performance on a dataset for (multi-label) prediction.
Parameters
----------
torch.float32 tensor of shape (T)
Mean of existing training labels across tasks, T for the number of tasks
torch.float32 tensor of shape (T)
Std of existing training labels across tasks, T for the number of tasks
"""
def __init__(self, mean=None, std=None):
self.y_pred = []
self.y_true = []
if (type(mean) != type(None)) and (type(std) != type(None)):
self.mean = mean.cpu()
self.std = std.cpu()
else:
self.mean = None
self.std = None
def update(self, y_pred, y_true):
"""Update for the result of an iteration
Parameters
----------
y_pred : float32 tensor
Predicted molecule labels with shape (B, T),
B for batch size and T for the number of tasks
y_true : float32 tensor
Ground truth molecule labels with shape (B, T)
"""
self.y_pred.append(y_pred.detach().cpu())
self.y_true.append(y_true.detach().cpu())
def _finalize_labels_and_prediction(self):
"""Concatenate the labels and predictions.
If normalization was performed on the labels, undo the normalization.
"""
y_pred = torch.cat(self.y_pred, dim=0)
y_true = torch.cat(self.y_true, dim=0)
if (self.mean is not None) and (self.std is not None):
# To compensate for the imbalance between labels during training,
# we normalize the ground truth labels with training mean and std.
# We need to undo that for evaluation.
y_pred = y_pred * self.std + self.mean
return y_pred, y_true
def pearson_r2(self):
"""Compute squared Pearson correlation coefficient
Returns
-------
float
"""
y_pred, y_true = self._finalize_labels_and_prediction()
return pearsonr(y_true[:, 0].numpy(), y_pred[:, 0].numpy())[0] ** 2
def mae(self):
"""Compute MAE
Returns
-------
float
"""
y_pred, y_true = self._finalize_labels_and_prediction()
return F.l1_loss(y_true, y_pred).data.item()
def rmse(self):
"""
Compute RMSE
Returns
-------
float
"""
y_pred, y_true = self._finalize_labels_and_prediction()
return np.sqrt(F.mse_loss(y_pred, y_true).cpu().item())
def compute_metric(self, metric_name):
"""Compute metric
Parameters
----------
metric_name : str
Name for the metric to compute.
Returns
-------
float
Metric value
"""
assert metric_name in ['pearson_r2', 'mae', 'rmse'], \
'Expect metric name to be "pearson_r2", "mae" or "rmse", got {}'.format(metric_name)
if metric_name == 'pearson_r2':
return self.pearson_r2()
if metric_name == 'mae':
return self.mae()
if metric_name == 'rmse':
return self.rmse()
......@@ -44,8 +44,8 @@ def run_an_eval_epoch(args, model, data_loader):
return np.mean(eval_meter.compute_metric(args['metric_name']))
def main(args):
args['device'] = "cuda" if torch.cuda.is_available() else "cpu"
set_random_seed()
args['device'] = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
set_random_seed(args['random_seed'])
# Interchangeable with other datasets
dataset, train_set, val_set, test_set = load_dataset_for_classification(args)
......
......@@ -6,11 +6,14 @@ from functools import partial
from utils import chirality
GCN_Tox21 = {
'random_seed': 0,
'batch_size': 128,
'lr': 1e-3,
'num_epochs': 100,
'atom_data_field': 'h',
'train_val_test_split': [0.8, 0.1, 0.1],
'frac_train': 0.8,
'frac_val': 0.1,
'frac_test': 0.1,
'in_feats': 74,
'gcn_hidden_feats': [64, 64],
'classifier_hidden_feats': 64,
......@@ -20,11 +23,14 @@ GCN_Tox21 = {
}
GAT_Tox21 = {
'random_seed': 0,
'batch_size': 128,
'lr': 1e-3,
'num_epochs': 100,
'atom_data_field': 'h',
'train_val_test_split': [0.8, 0.1, 0.1],
'frac_train': 0.8,
'frac_val': 0.1,
'frac_test': 0.1,
'in_feats': 74,
'gat_hidden_feats': [32, 32],
'classifier_hidden_feats': 64,
......@@ -35,6 +41,7 @@ GAT_Tox21 = {
}
MPNN_Alchemy = {
'random_seed': 0,
'batch_size': 16,
'num_epochs': 250,
'node_in_feats': 15,
......@@ -47,6 +54,7 @@ MPNN_Alchemy = {
}
SCHNET_Alchemy = {
'random_seed': 0,
'batch_size': 16,
'num_epochs': 250,
'norm': True,
......@@ -58,6 +66,7 @@ SCHNET_Alchemy = {
}
MGCN_Alchemy = {
'random_seed': 0,
'batch_size': 16,
'num_epochs': 250,
'norm': True,
......@@ -81,7 +90,9 @@ AttentiveFP_Aromaticity = {
'lr': 10 ** (-2.5),
'batch_size': 128,
'num_epochs': 800,
'train_val_test_split': [0.8, 0.1, 0.1],
'frac_train': 0.8,
'frac_val': 0.1,
'frac_test': 0.1,
'patience': 80,
'metric_name': 'rmse',
# Follow the atom featurization in the original work
......
......@@ -55,8 +55,8 @@ def run_an_eval_epoch(args, model, data_loader):
return total_score
def main(args):
args['device'] = "cuda" if torch.cuda.is_available() else "cpu"
set_random_seed()
args['device'] = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
set_random_seed(args['random_seed'])
train_set, val_set, test_set = load_dataset_for_regression(args)
train_loader = DataLoader(dataset=train_set,
......
......@@ -6,8 +6,7 @@ import torch
import torch.nn.functional as F
from dgl import model_zoo
from dgl.data.chem import one_hot_encoding
from dgl.data.utils import split_dataset
from dgl.data.chem import smiles_to_bigraph, one_hot_encoding, RandomSplitter
from sklearn.metrics import roc_auc_score
def set_random_seed(seed=0):
......@@ -278,8 +277,10 @@ def load_dataset_for_classification(args):
assert args['dataset'] in ['Tox21']
if args['dataset'] == 'Tox21':
from dgl.data.chem import Tox21
dataset = Tox21(atom_featurizer=args['atom_featurizer'])
train_set, val_set, test_set = split_dataset(dataset, args['train_val_test_split'])
dataset = Tox21(smiles_to_bigraph, args['atom_featurizer'])
train_set, val_set, test_set = RandomSplitter.train_val_test_split(
dataset, frac_train=args['frac_train'], frac_val=args['frac_val'],
frac_test=args['frac_test'], random_state=args['random_seed'])
return dataset, train_set, val_set, test_set
......@@ -310,10 +311,12 @@ def load_dataset_for_regression(args):
if args['dataset'] == 'Aromaticity':
from dgl.data.chem import PubChemBioAssayAromaticity
dataset = PubChemBioAssayAromaticity(atom_featurizer=args['atom_featurizer'],
bond_featurizer=args['bond_featurizer'])
train_set, val_set, test_set = split_dataset(dataset, frac_list=args['train_val_test_split'],
shuffle=True, random_state=args['random_seed'])
dataset = PubChemBioAssayAromaticity(smiles_to_bigraph,
args['atom_featurizer'],
args['bond_featurizer'])
train_set, val_set, test_set = RandomSplitter.train_val_test_split(
dataset, frac_train=args['frac_train'], frac_val=args['frac_val'],
frac_test=args['frac_test'], random_state=args['random_seed'])
return train_set, val_set, test_set
......
from .datasets import *
from .utils import *
from .csv_dataset import MoleculeCSVDataset
from .tox21 import Tox21
from .alchemy import TencentAlchemyDataset
from .pubchem_aromaticity import PubChemBioAssayAromaticity
from .csv_dataset import MoleculeCSVDataset
from .tox21 import Tox21
from .alchemy import TencentAlchemyDataset
from .pubchem_aromaticity import PubChemBioAssayAromaticity
from .pdbbind import PDBBind
......@@ -6,14 +6,13 @@ import numpy as np
import os
import os.path as osp
import pathlib
import pickle
import zipfile
from collections import defaultdict
from .utils import mol_to_complete_graph, atom_type_one_hot, atom_hybridization_one_hot, \
atom_is_aromatic
from ..utils import download, get_download_dir, _get_dgl_url, save_graphs, load_graphs
from ... import backend as F
from ..utils import mol_to_complete_graph, atom_type_one_hot, \
atom_hybridization_one_hot, atom_is_aromatic
from ...utils import download, get_download_dir, _get_dgl_url, save_graphs, load_graphs
from .... import backend as F
try:
import pandas as pd
......@@ -156,11 +155,12 @@ class TencentAlchemyDataset(object):
mol_to_graph: callable, str -> DGLGraph
A function turning an RDKit molecule instance into a DGLGraph.
Default to :func:`dgl.data.chem.mol_to_complete_graph`.
atom_featurizer : callable, rdkit.Chem.rdchem.Mol -> dict
Featurization for atoms in a molecule, which can be used to update
ndata for a DGLGraph. By default, we store the atom atomic numbers
under the name ``"node_type"`` and store the atom features under the
name ``"n_feat"``. The atom features include:
node_featurizer : callable, rdkit.Chem.rdchem.Mol -> dict
Featurization for nodes like atoms in a molecule, which can be used to update
ndata for a DGLGraph. By default, we construct graphs where nodes represent atoms
and node features represent atom features. We store the atomic numbers under the
name ``"node_type"`` and store the atom features under the name ``"n_feat"``.
The atom features include:
* One hot encoding for atom types
* Atomic number of atoms
* Whether the atom is a donor
......@@ -168,16 +168,17 @@ class TencentAlchemyDataset(object):
* Whether the atom is aromatic
* One hot encoding for atom hybridization
* Total number of Hs on the atom
bond_featurizer : callable, rdkit.Chem.rdchem.Mol -> dict
Featurization for bonds in a molecule, which can be used to update
edata for a DGLGraph. By default, we store the distance between the
end atoms under the name ``"distance"`` and store the bond features under
the name ``"e_feat"``. The bond features are one-hot encodings of the bond type.
edge_featurizer : callable, rdkit.Chem.rdchem.Mol -> dict
Featurization for edges like bonds in a molecule, which can be used to update
edata for a DGLGraph. By default, we construct edges between every pair of atoms,
excluding the self loops. We store the distance between the end atoms under the name
``"distance"`` and store the edge features under the name ``"e_feat"``. The edge
features represent one hot encoding of edge types (bond types and non-bond edges).
"""
def __init__(self, mode='dev', from_raw=False,
mol_to_graph=mol_to_complete_graph,
atom_featurizer=alchemy_nodes,
bond_featurizer=alchemy_edges):
node_featurizer=alchemy_nodes,
edge_featurizer=alchemy_edges):
if mode == 'test':
raise ValueError('The test mode is not supported before '
'the Alchemy contest finishes.')
......@@ -205,9 +206,9 @@ class TencentAlchemyDataset(object):
archive.extractall(file_dir)
archive.close()
self._load(mol_to_graph, atom_featurizer, bond_featurizer)
self._load(mol_to_graph, node_featurizer, edge_featurizer)
def _load(self, mol_to_graph, atom_featurizer, bond_featurizer):
def _load(self, mol_to_graph, node_featurizer, edge_featurizer):
if not self.from_raw:
self.graphs, label_dict = load_graphs(osp.join(self.file_dir, "%s_graphs.bin" % self.mode))
self.labels = label_dict['labels']
......@@ -230,8 +231,8 @@ class TencentAlchemyDataset(object):
for mol, label in zip(supp, self.target.iterrows()):
cnt += 1
print('Processing molecule {:d}/{:d}'.format(cnt, dataset_size))
graph = mol_to_graph(mol, atom_featurizer=atom_featurizer,
bond_featurizer=bond_featurizer)
graph = mol_to_graph(mol, node_featurizer=node_featurizer,
edge_featurizer=edge_featurizer)
smiles = Chem.MolToSmiles(mol)
self.smiles.append(smiles)
self.graphs.append(graph)
......
from __future__ import absolute_import
import dgl.backend as F
import numpy as np
import os
import sys
from ..utils import save_graphs, load_graphs
from ... import backend as F
from ...utils import save_graphs, load_graphs
from .... import backend as F
class MoleculeCSVDataset(object):
"""MoleculeCSVDataset
......@@ -27,21 +26,21 @@ class MoleculeCSVDataset(object):
Column names other than smiles column would be considered as task names.
smiles_to_graph: callable, str -> DGLGraph
A function turning a SMILES into a DGLGraph.
atom_featurizer : callable, rdkit.Chem.rdchem.Mol -> dict
Featurization for atoms in a molecule, which can be used to update
node_featurizer : callable, rdkit.Chem.rdchem.Mol -> dict
Featurization for nodes like atoms in a molecule, which can be used to update
ndata for a DGLGraph.
bond_featurizer : callable, rdkit.Chem.rdchem.Mol -> dict
Featurization for bonds in a molecule, which can be used to update
edge_featurizer : callable, rdkit.Chem.rdchem.Mol -> dict
Featurization for edges like bonds in a molecule, which can be used to update
edata for a DGLGraph.
smiles_column: str
Column name that including smiles.
cache_file_path: str
Path to store the preprocessed data.
"""
def __init__(self, df, smiles_to_graph, atom_featurizer, bond_featurizer,
def __init__(self, df, smiles_to_graph, node_featurizer, edge_featurizer,
smiles_column, cache_file_path):
if 'rdkit' not in sys.modules:
from ...base import dgl_warning
from ....base import dgl_warning
dgl_warning(
"Please install RDKit (Recommended Version is 2018.09.3)")
self.df = df
......@@ -49,9 +48,9 @@ class MoleculeCSVDataset(object):
self.task_names = self.df.columns.drop([smiles_column]).tolist()
self.n_tasks = len(self.task_names)
self.cache_file_path = cache_file_path
self._pre_process(smiles_to_graph, atom_featurizer, bond_featurizer)
self._pre_process(smiles_to_graph, node_featurizer, edge_featurizer)
def _pre_process(self, smiles_to_graph, atom_featurizer, bond_featurizer):
def _pre_process(self, smiles_to_graph, node_featurizer, edge_featurizer):
"""Pre-process the dataset
* Convert molecules from smiles format into DGLGraphs
......@@ -63,11 +62,11 @@ class MoleculeCSVDataset(object):
----------
smiles_to_graph : callable, SMILES -> DGLGraph
Function for converting a SMILES (str) into a DGLGraph.
atom_featurizer : callable, rdkit.Chem.rdchem.Mol -> dict
Featurization for atoms in a molecule, which can be used to update
node_featurizer : callable, rdkit.Chem.rdchem.Mol -> dict
Featurization for nodes like atoms in a molecule, which can be used to update
ndata for a DGLGraph.
bond_featurizer : callable, rdkit.Chem.rdchem.Mol -> dict
Featurization for bonds in a molecule, which can be used to update
edge_featurizer : callable, rdkit.Chem.rdchem.Mol -> dict
Featurization for edges like bonds in a molecule, which can be used to update
edata for a DGLGraph.
"""
if os.path.exists(self.cache_file_path):
......@@ -81,8 +80,8 @@ class MoleculeCSVDataset(object):
self.graphs = []
for i, s in enumerate(self.smiles):
print('Processing molecule {:d}/{:d}'.format(i+1, len(self)))
self.graphs.append(smiles_to_graph(s, atom_featurizer=atom_featurizer,
bond_featurizer=bond_featurizer))
self.graphs.append(smiles_to_graph(s, node_featurizer=node_featurizer,
edge_featurizer=edge_featurizer))
_label_values = self.df[self.task_names].values
# np.nan_to_num will also turn inf into a very large number
self.labels = F.zerocopy_from_numpy(np.nan_to_num(_label_values).astype(np.float32))
......
"""PDBBind dataset processed by MoleculeNet."""
import numpy as np
import os
import pandas as pd
from ..utils import multiprocess_load_molecules, ACNN_graph_construction_and_featurization
from ...utils import get_download_dir, download, _get_dgl_url, extract_archive
from .... import backend as F
class PDBBind(object):
"""PDBbind dataset processed by MoleculeNet.
The description below is mainly based on
`[1] <https://pubs.rsc.org/en/content/articlelanding/2018/sc/c7sc02664a#cit50>`__.
The PDBBind database consists of experimentally measured binding affinities for
bio-molecular complexes `[2] <https://www.ncbi.nlm.nih.gov/pubmed/?term=15163179%5Buid%5D>`__,
`[3] <https://www.ncbi.nlm.nih.gov/pubmed/?term=15943484%5Buid%5D>`__. It provides detailed
3D Cartesian coordinates of both ligands and their target proteins derived from experimental
(e.g., X-ray crystallography) measurements. The availability of coordinates of the
protein-ligand complexes permits structure-based featurization that is aware of the
protein-ligand binding geometry. The authors of
`[1] <https://pubs.rsc.org/en/content/articlelanding/2018/sc/c7sc02664a#cit50>`__ use the
"refined" and "core" subsets of the database
`[4] <https://www.ncbi.nlm.nih.gov/pubmed/?term=25301850%5Buid%5D>`__, more carefully
processed for data artifacts, as additional benchmarking targets.
References:
* [1] MoleculeNet: a benchmark for molecular machine learning
* [2] The PDBbind database: collection of binding affinities for protein-ligand complexes
with known three-dimensional structures
* [3] The PDBbind database: methodologies and updates
* [4] PDB-wide collection of binding data: current status of the PDBbind database
Parameters
----------
subset : str
In MoleculeNet, we can use either the "refined" subset or the "core" subset. We can
retrieve them by setting ``subset`` to be ``'refined'`` or ``'core'``. The size
of the ``'core'`` set is 195 and the size of the ``'refined'`` set is 3706.
load_binding_pocket : bool
Whether to load binding pockets or full proteins. Default to True.
add_hydrogens : bool
Whether to add hydrogens via pdbfixer. Default to False.
sanitize : bool
Whether sanitization is performed in initializing RDKit molecule instances. See
https://www.rdkit.org/docs/RDKit_Book.html for details of the sanitization.
Default to False.
calc_charges : bool
Whether to add Gasteiger charges via RDKit. Setting this to be True will enforce
``add_hydrogens`` and ``sanitize`` to be True. Default to False.
remove_hs : bool
Whether to remove hydrogens via RDKit. Note that removing hydrogens can be quite
slow for large molecules. Default to False.
use_conformation : bool
Whether we need to extract molecular conformation from proteins and ligands.
Default to True.
construct_graph_and_featurize : callable
Construct a DGLHeteroGraph for the use of GNNs. Mapping self.ligand_mols[i],
self.protein_mols[i], self.ligand_coordinates[i] and self.protein_coordinates[i]
to a DGLHeteroGraph. Default to :func:`ACNN_graph_construction_and_featurization`.
zero_padding : bool
Whether to perform zero padding. While DGL does not necessarily require zero padding,
pooling operations for variable length inputs can introduce stochastic behaviour, which
is not desired for sensitive scenarios. Default to True.
num_processes : int or None
Number of worker processes to use. If None,
then we will use the number of CPUs in the system. Default to 64.
"""
def __init__(self, subset, load_binding_pocket=True, add_hydrogens=False,
sanitize=False, calc_charges=False, remove_hs=False, use_conformation=True,
construct_graph_and_featurize=ACNN_graph_construction_and_featurization,
zero_padding=True, num_processes=64):
self.task_names = ['-logKd/Ki']
self.n_tasks = len(self.task_names)
self._url = 'dataset/pdbbind_v2015.tar.gz'
root_dir_path = get_download_dir()
data_path = root_dir_path + '/pdbbind_v2015.tar.gz'
extracted_data_path = root_dir_path + '/pdbbind_v2015'
download(_get_dgl_url(self._url), path=data_path)
extract_archive(data_path, extracted_data_path)
if subset == 'core':
index_label_file = extracted_data_path + '/v2015/INDEX_core_data.2013'
elif subset == 'refined':
index_label_file = extracted_data_path + '/v2015/INDEX_refined_data.2015'
else:
raise ValueError(
'Expect the subset_choice to be either '
'core or refined, got {}'.format(subset))
self._preprocess(extracted_data_path, index_label_file, load_binding_pocket,
add_hydrogens, sanitize, calc_charges, remove_hs, use_conformation,
construct_graph_and_featurize, zero_padding, num_processes)
def _filter_out_invalid(self, ligands_loaded, proteins_loaded, use_conformation):
"""Filter out invalid ligand-protein pairs.
Parameters
----------
ligands_loaded : list
Each element is a 2-tuple of the RDKit molecule instance and its associated atom
coordinates. None is used to represent invalid/non-existing molecule or coordinates.
proteins_loaded : list
Each element is a 2-tuple of the RDKit molecule instance and its associated atom
coordinates. None is used to represent invalid/non-existing molecule or coordinates.
use_conformation : bool
Whether we need conformation information (atom coordinates) and filter out molecules
without valid conformation.
"""
num_pairs = len(proteins_loaded)
self.indices, self.ligand_mols, self.protein_mols = [], [], []
if use_conformation:
self.ligand_coordinates, self.protein_coordinates = [], []
else:
# Use None for placeholders.
self.ligand_coordinates = [None for _ in range(num_pairs)]
self.protein_coordinates = [None for _ in range(num_pairs)]
for i in range(num_pairs):
ligand_mol, ligand_coordinates = ligands_loaded[i]
protein_mol, protein_coordinates = proteins_loaded[i]
if (not use_conformation) and all(v is not None for v in [protein_mol, ligand_mol]):
self.indices.append(i)
self.ligand_mols.append(ligand_mol)
self.protein_mols.append(protein_mol)
elif all(v is not None for v in [
protein_mol, protein_coordinates, ligand_mol, ligand_coordinates]):
self.indices.append(i)
self.ligand_mols.append(ligand_mol)
self.ligand_coordinates.append(ligand_coordinates)
self.protein_mols.append(protein_mol)
self.protein_coordinates.append(protein_coordinates)
def _preprocess(self, root_path, index_label_file, load_binding_pocket,
add_hydrogens, sanitize, calc_charges, remove_hs, use_conformation,
construct_graph_and_featurize, zero_padding, num_processes):
"""Preprocess the dataset.
The pre-processing proceeds as follows:
1. Load the dataset
2. Clean the dataset and filter out invalid pairs
3. Construct graphs
4. Prepare node and edge features
Parameters
----------
root_path : str
Root path for molecule files.
index_label_file : str
Path to the index file for the dataset.
load_binding_pocket : bool
Whether to load binding pockets or full proteins.
add_hydrogens : bool
Whether to add hydrogens via pdbfixer.
sanitize : bool
Whether sanitization is performed in initializing RDKit molecule instances. See
https://www.rdkit.org/docs/RDKit_Book.html for details of the sanitization.
calc_charges : bool
Whether to add Gasteiger charges via RDKit. Setting this to be True will enforce
``add_hydrogens`` and ``sanitize`` to be True.
remove_hs : bool
Whether to remove hydrogens via RDKit. Note that removing hydrogens can be quite
slow for large molecules.
use_conformation : bool
Whether we need to extract molecular conformation from proteins and ligands.
construct_graph_and_featurize : callable
Construct a DGLHeteroGraph for the use of GNNs. Mapping self.ligand_mols[i],
self.protein_mols[i], self.ligand_coordinates[i] and self.protein_coordinates[i]
to a DGLHeteroGraph. Default to :func:`ACNN_graph_construction_and_featurization`.
zero_padding : bool
Whether to perform zero padding. While DGL does not necessarily require zero padding,
pooling operations for variable length inputs can introduce stochastic behaviour, which
is not desired for sensitive scenarios.
num_processes : int or None
Number of worker processes to use. If None,
then we will use the number of CPUs in the system.
"""
contents = []
with open(index_label_file, 'r') as f:
for line in f.readlines():
if line[0] != "#":
splitted_elements = line.split()
if len(splitted_elements) == 8:
# Ignore "//"
contents.append(splitted_elements[:5] + splitted_elements[6:])
else:
print('Incorrect data format.')
print(splitted_elements)
self.df = pd.DataFrame(contents, columns=(
'PDB_code', 'resolution', 'release_year',
'-logKd/Ki', 'Kd/Ki', 'reference', 'ligand_name'))
pdbs = self.df['PDB_code'].tolist()
self.ligand_files = [os.path.join(
root_path, 'v2015', pdb, '{}_ligand.sdf'.format(pdb)) for pdb in pdbs]
if load_binding_pocket:
self.protein_files = [os.path.join(
root_path, 'v2015', pdb, '{}_pocket.pdb'.format(pdb)) for pdb in pdbs]
else:
self.protein_files = [os.path.join(
root_path, 'v2015', pdb, '{}_protein.pdb'.format(pdb)) for pdb in pdbs]
num_processes = min(num_processes, len(pdbs))
print('Loading ligands...')
ligands_loaded = multiprocess_load_molecules(self.ligand_files,
add_hydrogens=add_hydrogens,
sanitize=sanitize,
calc_charges=calc_charges,
remove_hs=remove_hs,
use_conformation=use_conformation,
num_processes=num_processes)
print('Loading proteins...')
proteins_loaded = multiprocess_load_molecules(self.protein_files,
add_hydrogens=add_hydrogens,
sanitize=sanitize,
calc_charges=calc_charges,
remove_hs=remove_hs,
use_conformation=use_conformation,
num_processes=num_processes)
self._filter_out_invalid(ligands_loaded, proteins_loaded, use_conformation)
self.df = self.df.iloc[self.indices]
self.labels = F.zerocopy_from_numpy(self.df[self.task_names].values.astype(np.float32))
print('Finished cleaning the dataset, '
'got {:d}/{:d} valid pairs'.format(len(self), len(pdbs)))
# Prepare zero padding
if zero_padding:
max_num_ligand_atoms = 0
max_num_protein_atoms = 0
for i in range(len(self)):
max_num_ligand_atoms = max(
max_num_ligand_atoms, self.ligand_mols[i].GetNumAtoms())
max_num_protein_atoms = max(
max_num_protein_atoms, self.protein_mols[i].GetNumAtoms())
else:
max_num_ligand_atoms = None
max_num_protein_atoms = None
print('Start constructing graphs and featurizing them.')
self.graphs = []
for i in range(len(self)):
print('Constructing and featurizing datapoint {:d}/{:d}'.format(i+1, len(self)))
self.graphs.append(construct_graph_and_featurize(
self.ligand_mols[i], self.protein_mols[i],
self.ligand_coordinates[i], self.protein_coordinates[i],
max_num_ligand_atoms, max_num_protein_atoms))
def __len__(self):
"""Get the size of the dataset.
Returns
-------
int
Number of valid ligand-protein pairs in the dataset.
"""
return len(self.indices)
def __getitem__(self, item):
"""Get the datapoint associated with the index.
Parameters
----------
item : int
Index for the datapoint.
Returns
-------
int
Index for the datapoint.
rdkit.Chem.rdchem.Mol
RDKit molecule instance for the ligand molecule.
rdkit.Chem.rdchem.Mol
RDKit molecule instance for the protein molecule.
DGLHeteroGraph
Pre-processed DGLHeteroGraph with features extracted.
Float32 tensor
Label for the datapoint.
"""
return item, self.ligand_mols[item], self.protein_mols[item], \
self.graphs[item], self.labels[item]
......@@ -2,8 +2,8 @@ import pandas as pd
import sys
from .csv_dataset import MoleculeCSVDataset
from .utils import smiles_to_bigraph
from ..utils import get_download_dir, download, _get_dgl_url
from ..utils import smiles_to_bigraph
from ...utils import get_download_dir, download, _get_dgl_url
class PubChemBioAssayAromaticity(MoleculeCSVDataset):
"""Subset of PubChem BioAssay Dataset for aromaticity prediction.
......@@ -15,12 +15,24 @@ class PubChemBioAssayAromaticity(MoleculeCSVDataset):
The dataset was constructed by sampling 3945 molecules with 0-40 aromatic atoms from the
PubChem BioAssay dataset.
Parameters
----------
smiles_to_graph: callable, str -> DGLGraph
A function turning smiles into a DGLGraph.
Default to :func:`dgl.data.chem.smiles_to_bigraph`.
node_featurizer : callable, rdkit.Chem.rdchem.Mol -> dict
Featurization for nodes like atoms in a molecule, which can be used to update
ndata for a DGLGraph. Default to None.
edge_featurizer : callable, rdkit.Chem.rdchem.Mol -> dict
Featurization for edges like bonds in a molecule, which can be used to update
edata for a DGLGraph. Default to None.
"""
def __init__(self, smiles_to_graph=smiles_to_bigraph,
atom_featurizer=None,
bond_featurizer=None):
node_featurizer=None,
edge_featurizer=None):
if 'pandas' not in sys.modules:
from ...base import dgl_warning
from ....base import dgl_warning
dgl_warning("Please install pandas")
self._url = 'dataset/pubchem_bioassay_aromaticity.csv'
......@@ -28,5 +40,5 @@ class PubChemBioAssayAromaticity(MoleculeCSVDataset):
download(_get_dgl_url(self._url), path=data_path)
df = pd.read_csv(data_path)
super(PubChemBioAssayAromaticity, self).__init__(df, smiles_to_graph, atom_featurizer, bond_featurizer,
super(PubChemBioAssayAromaticity, self).__init__(df, smiles_to_graph, node_featurizer, edge_featurizer,
"cano_smiles", "pubchem_aromaticity_dglgraph.bin")
import sys
from .csv_dataset import MoleculeCSVDataset
from .utils import smiles_to_bigraph
from ..utils import get_download_dir, download, _get_dgl_url
from ... import backend as F
from ..utils import smiles_to_bigraph
from ...utils import get_download_dir, download, _get_dgl_url
from .... import backend as F
try:
import pandas as pd
......@@ -32,18 +32,18 @@ class Tox21(MoleculeCSVDataset):
smiles_to_graph: callable, str -> DGLGraph
A function turning smiles into a DGLGraph.
Default to :func:`dgl.data.chem.smiles_to_bigraph`.
atom_featurizer : callable, rdkit.Chem.rdchem.Mol -> dict
Featurization for atoms in a molecule, which can be used to update
node_featurizer : callable, rdkit.Chem.rdchem.Mol -> dict
Featurization for nodes like atoms in a molecule, which can be used to update
ndata for a DGLGraph. Default to None.
bond_featurizer : callable, rdkit.Chem.rdchem.Mol -> dict
Featurization for bonds in a molecule, which can be used to update
edge_featurizer : callable, rdkit.Chem.rdchem.Mol -> dict
Featurization for edges like bonds in a molecule, which can be used to update
edata for a DGLGraph. Default to None.
"""
def __init__(self, smiles_to_graph=smiles_to_bigraph,
atom_featurizer=None,
bond_featurizer=None):
node_featurizer=None,
edge_featurizer=None):
if 'pandas' not in sys.modules:
from ...base import dgl_warning
from ....base import dgl_warning
dgl_warning("Please install pandas")
self._url = 'dataset/tox21.csv.gz'
......@@ -54,7 +54,7 @@ class Tox21(MoleculeCSVDataset):
df = df.drop(columns=['mol_id'])
super(Tox21, self).__init__(df, smiles_to_graph, atom_featurizer, bond_featurizer,
super(Tox21, self).__init__(df, smiles_to_graph, node_featurizer, edge_featurizer,
"smiles", "tox21_dglgraph.bin")
self._weight_balancing()
......
from .splitters import *
from .featurizers import *
from .mol_to_graph import *
from .complex_to_graph import *
from .rdkit_utils import *
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment