README.md 5.27 KB
Newer Older
1
2
3
4
5
6
# DGL-LifeSci

## Introduction

Deep learning on graphs has been an arising trend in the past few years. There are a lot of graphs in 
life science such as molecular graphs and biological networks, making it an import area for applying 
7
deep learning on graphs. DGL-LifeSci is a DGL-based package for various applications in life science 
8
9
10
11
12
with graph neural networks. 

We provide various functionalities, including but not limited to methods for graph construction, 
featurization, and evaluation, model architectures, training scripts and pre-trained models.

13
14
**For a full list of work implemented in DGL-LifeSci, see [here](examples/README.md).**

15
## Example Usage
16

17
18
19
20
To apply graph neural networks to molecules with DGL, we need to first construct `DGLGraph` -- 
the graph data structure in DGL and prepare initial node/edge features. Below gives an example of 
constructing a bi-directed graph from a molecule and featurizing it with atom and bond features such 
as atom type and bond type.
21

22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
```python
from dgllife.utils import smiles_to_bigraph, CanonicalAtomFeaturizer, CanonicalBondFeaturizer

# Node featurizer
node_featurizer = CanonicalAtomFeaturizer(atom_data_field='h')
# Edge featurizer
edge_featurizer = CanonicalBondFeaturizer(bond_data_field='h')
# SMILES (a string representation for molecule) for Penicillin
smiles = 'CC1(C(N2C(S1)C(C2=O)NC(=O)CC3=CC=CC=C3)C(=O)O)C'
g = smiles_to_bigraph(smiles=smiles, 
                      node_featurizer=node_featurizer,
                      edge_featurizer=edge_featurizer)
print(g)
"""
DGLGraph(num_nodes=23, num_edges=50,
         ndata_schemes={'h': Scheme(shape=(74,), dtype=torch.float32)}
         edata_schemes={'h': Scheme(shape=(12,), dtype=torch.float32)})
"""
```
41

42
43
We implement various models that users can import directly. Below gives an example of defining a GCN-based model  
for molecular property prediction.
Mufei Li's avatar
Mufei Li committed
44

45
46
```python
from dgllife.model import GCNPredictor
Mufei Li's avatar
Mufei Li committed
47

48
model = GCNPredictor(in_feats=1)
Mufei Li's avatar
Mufei Li committed
49
```
50

51
For a full example of applying `GCNPredictor`, run the following command
52

53
54
```bash
python examples/property_prediction/classification.py -m GCN -d Tox21
55
56
```

57
58
For more examples on molecular property prediction, generative models, protein-ligand binding affinity 
prediction and reaction prediction, see `examples`.
59

60
61
We also provide pre-trained models for most examples, which can be used off-shelf without training from scratch. 
Below gives an example of loading a pre-trained model for `GCNPredictor` on a molecular property prediction dataset.
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80

```python
from dgllife.data import Tox21
from dgllife.model import load_pretrained
from dgllife.utils import smiles_to_bigraph, CanonicalAtomFeaturizer

dataset = Tox21(smiles_to_bigraph, CanonicalAtomFeaturizer())
model = load_pretrained('GCN_Tox21') # Pretrained model loaded
model.eval()

smiles, g, label, mask = dataset[0]
feats = g.ndata.pop('h')
label_pred = model(g, feats)
print(smiles)                   # CCOc1ccc2nc(S(N)(=O)=O)sc2c1
print(label_pred[:, mask != 0]) # Mask non-existing labels
# tensor([[ 1.4190, -0.1820,  1.2974,  1.4416,  0.6914,  
# 2.0957,  0.5919,  0.7715, 1.7273,  0.2070]])
```

81
Similarly, we can load a pre-trained model for generating molecules.
82

83
```python
84
85
86
87
from dgllife.model import load_pretrained

model = load_pretrained('DGMG_ZINC_canonical')
model.eval()
88
smiles = []
89
for i in range(4):
90
    smiles.append(model(rdkit_mol=True))
91

92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
print(smiles)
# ['CC1CCC2C(CCC3C2C(NC2=CC(Cl)=CC=C2N)S3(=O)=O)O1',
# 'O=C1SC2N=CN=C(NC(SC3=CC=CC=N3)C1=CC=CO)C=2C1=CCCC1', 
# 'CC1C=CC(=CC=1)C(=O)NN=C(C)C1=CC=CC2=CC=CC=C21', 
# 'CCN(CC1=CC=CC=C1F)CC1CCCN(C)C1']
```

If you are running the code block above in Jupyter notebook, you can also visualize the molecules generated with

```python
from IPython.display import SVG
from rdkit import Chem
from rdkit.Chem import Draw

mols = [Chem.MolFromSmiles(s) for s in smiles]
107
108
109
SVG(Draw.MolsToGridImage(mols, molsPerRow=4, subImgSize=(180, 150), useSVG=True))
```

110
![](https://data.dgl.ai/dgllife/dgmg/dgmg_model_zoo_example2.png)
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
    
## Installation

DGL-LifeSci requires python 3.5+, DGL 0.4.3+ and PyTorch 1.2.0+.

Additionally, we require `RDKit 2018.09.3` for cheminformatics. We recommend installing it with 
`conda install -c conda-forge rdkit==2018.09.3`. For other installation recipes, 
see the [official documentation](https://www.rdkit.org/docs/Install.html).

We provide installation of `DGL-LifeSci` with pip. Once you have installed the package, 
verify the success of installation with 

```python
import dgllife

print(dgllife.__version__)
# 0.2.0
```

### Using pip

```
pip install dgllife
```
135
136
137
138
139

## Speed Reference

Below we provide some reference numbers to show how DGL improves the speed of training models per epoch in seconds.

140
141
142
143
144
145
| Model                              | Original Implementation | DGL Implementation | Improvement |
| ---------------------------------- | ----------------------- | ------------------ | ----------- |
| GCN on Tox21                       | 5.5 (DeepChem)          | 1.0                | 5.5x        |
| AttentiveFP on Aromaticity         | 6.0                     | 1.2                | 5x          |
| JTNN on ZINC                       | 1826                    | 743                | 2.5x        |
| WLN for reaction center prediction | 11657                   | 5095               | 2.3x        |                                                           |