README.md 7.89 KB
Newer Older
1
2
3
4
5
6
7
8
9
10
11
12
13
14
# Property Prediction

## Classification

Classification tasks require assigning discrete labels to a molecule, e.g. molecule toxicity.

### Datasets
- **Tox21**. The ["Toxicology in the 21st Century" (Tox21)](https://tripod.nih.gov/tox21/challenge/) initiative created
a public database measuring toxicity of compounds, which has been used in the 2014 Tox21 Data Challenge. The dataset
contains qualitative toxicity measurements for 8014 compounds on 12 different targets, including nuclear receptors and
stress response pathways. Each target yields a binary prediction problem. MoleculeNet [1] randomly splits the dataset
into training, validation and test set with a 80/10/10 ratio. By default we follow their split method.

### Models
15
- **Weave** [9]. Weave is one of the pioneering efforts in applying graph neural networks to molecular property prediction.
16
17
18
19
20
21
22
23
24
25
- **Graph Convolutional Network** [2], [3]. Graph Convolutional Networks (GCN) have been one of the most popular graph neural
networks and they can be easily extended for graph level prediction. MoleculeNet [1] reports baseline results of graph
convolutions over multiple datasets.
- **Graph Attention Networks** [7]. Graph Attention Networks (GATs) incorporate multi-head attention into GCNs,
explicitly modeling the interactions between adjacent atoms.

### Usage

Use `classification.py` with arguments
```
26
-m {GCN, GAT, Weave}, MODEL, model to use
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
-d {Tox21}, DATASET, dataset to use
```

If you want to use the pre-trained model, simply add `-p`.

We use GPU whenever it is available.

### Performance

#### GCN on Tox21

| Source           | Averaged Test ROC-AUC Score |
| ---------------- | --------------------------- |
| MoleculeNet [1]  | 0.829                       |
| [DeepChem example](https://github.com/deepchem/deepchem/blob/master/examples/tox21/tox21_tensorgraph_graph_conv.py) | 0.813                  |
| Pretrained model | 0.833                       |

Note that the dataset is randomly split so these numbers are only for reference and they do not necessarily suggest
a real difference.

#### GAT on Tox21

| Source           | Averaged Test ROC-AUC Score |
| ---------------- | --------------------------- |
| Pretrained model | 0.853                       |

53
54
55
56
57
58
#### Weave on Tox21

| Source           | Averaged Test ROC-AUC Score |
| ---------------- | --------------------------- |
| Pretrained model | 0.8074                      |

59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
## Regression   

Regression tasks require assigning continuous labels to a molecule, e.g. molecular energy.

### Datasets  

- **Alchemy**. The [Alchemy Dataset](https://alchemy.tencent.com/) is introduced by Tencent Quantum Lab to facilitate the development of new 
machine learning models useful for chemistry and materials science. The dataset lists 12 quantum mechanical properties of 130,000+ organic 
molecules comprising up to 12 heavy atoms (C, N, O, S, F and Cl), sampled from the [GDBMedChem](http://gdb.unibe.ch/downloads/) database. 
These properties have been calculated using the open-source computational chemistry program Python-based Simulation of Chemistry Framework 
([PySCF](https://github.com/pyscf/pyscf)). The Alchemy dataset expands on the volume and diversity of existing molecular datasets such as QM9. 
- **PubChem BioAssay Aromaticity**. The dataset is introduced in 
[Pushing the Boundaries of Molecular Representation for Drug Discovery with the Graph Attention Mechanism](https://www.ncbi.nlm.nih.gov/pubmed/31408336), 
for the task of predicting the number of aromatic atoms in molecules. The dataset was constructed by sampling 3945 molecules with 0-40 aromatic atoms 
from the PubChem BioAssay dataset.

### Models  

- **Message Passing Neural Network** [6]. Message Passing Neural Networks (MPNNs) have reached the best performance on
the QM9 dataset for some time.
- **SchNet** [4]. SchNet employs continuous filter convolutional layers to model quantum interactions in molecules 
without requiring them to lie on grids.
- **Multilevel Graph Convolutional Neural Network** [5]. Multilevel Graph Convolutional Neural Networks (MGCN) are 
hierarchical graph neural networks that extract features from the conformation and spatial information followed by the
multilevel interactions.
- **AttentiveFP** [8]. AttentiveFP combines attention and GRU for better model capacity and shows competitive 
performance across datasetts.

### Usage

Use `regression.py` with arguments
```
-m {MPNN, SchNet, MGCN, AttentiveFP}, Model to use
-d {Alchemy, Aromaticity}, Dataset to use
```

If you want to use the pre-trained model, simply add `-p`. Currently we only support pre-trained models of AttentiveFP
on PubChem BioAssay Aromaticity dataset.

### Performance    

#### Alchemy

The Alchemy contest is still ongoing. Before the test set is fully released, we only include the performance numbers
on the training and validation set for reference.

| Model      | Training MAE | Validation MAE |  
| ---------- | ------------ | -------------- |
| SchNet [4] | 0.0651       | 0.0925         |
| MGCN [5]   | 0.0582       | 0.0942         |
| MPNN [6]   | 0.1004       | 0.1587         |

#### PubChem BioAssay Aromaticity

| Model           | Test RMSE |
| --------------- | --------- |
| AttentiveFP [8] | 0.7508    |

Note that the dataset is randomly split so this number is only for reference.

## Interpretation

[8] visualizes the weights of atoms in readout for possible interpretations like the figure below. 
We provide a jupyter notebook for performing the visualization and you can download it with 
123
`wget https://data.dgl.ai/dgllife/attentive_fp/atom_weight_visualization.ipynb`.
124

125
![](https://data.dgl.ai/dgllife/attentive_fp_vis_example.png)
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186

## Dataset Customization

Generally we follow the practice of PyTorch.

A dataset class should implement `__getitem__(self, index)` and `__len__(self)` method

```python
class CustomDataset(object):
    def __init__(self):
        pass

    def __getitem__(self, index):
        """
        Parameters
        ----------
        index : int
            Index for the datapoint.
        
        Returns
        -------
        str
            SMILES for the molecule
        DGLGraph
            Constructed DGLGraph for the molecule
        1D Tensor of dtype float32
            Labels of the datapoint
        """
        return self.smiles[index], self.graphs[index], self.labels[index]
    
    def __len__(self):
        return len(self.smiles)
```

We provide various methods for graph construction in `dgllife.utils.mol_to_graph`. If your dataset can 
be converted to a pandas dataframe, e.g. a .csv file, you may use `MoleculeCSVDataset` in 
`dgllife.data.csv_dataset`.

## References
[1] Wu et al. (2017) MoleculeNet: a benchmark for molecular machine learning. *Chemical Science* 9, 513-530.

[2] Duvenaud et al. (2015) Convolutional networks on graphs for learning molecular fingerprints. *Advances in neural 
information processing systems (NeurIPS)*, 2224-2232.

[3] Kipf et al. (2017) Semi-Supervised Classification with Graph Convolutional Networks.
*The International Conference on Learning Representations (ICLR)*.

[4] Schütt et al. (2017) SchNet: A continuous-filter convolutional neural network for modeling quantum interactions. 
*Advances in Neural Information Processing Systems (NeurIPS)*, 992-1002.

[5] Lu et al. (2019) Molecular Property Prediction: A Multilevel Quantum Interactions Modeling Perspective. 
*The 33rd AAAI Conference on Artificial Intelligence*. 

[6] Gilmer et al. (2017) Neural Message Passing for Quantum Chemistry. *Proceedings of the 34th International Conference on 
Machine Learning*, JMLR. 1263-1272.

[7] Veličković et al. (2018) Graph Attention Networks. 
*The International Conference on Learning Representations (ICLR)*. 

[8] Xiong et al. (2019) Pushing the Boundaries of Molecular Representation for Drug Discovery with the Graph 
Attention Mechanism. *Journal of Medicinal Chemistry*.
187
188
189

[9] Kearnes et al. (2016) Molecular graph convolutions: moving beyond fingerprints. 
*Journal of Computer-Aided Molecular Design*.