README.md 6.32 KB
Newer Older
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
# Learning Deep Generative Models of Graphs (DGMG)

Yujia Li, Oriol Vinyals, Chris Dyer, Razvan Pascanu, and Peter Battaglia. 
Learning Deep Generative Models of Graphs. *arXiv preprint arXiv:1803.03324*, 2018.

DGMG generates graphs by progressively adding nodes and edges as below:
![](https://user-images.githubusercontent.com/19576924/48605003-7f11e900-e9b6-11e8-8880-87362348e154.png)

For molecules, the nodes are atoms and the edges are bonds.

**Goal**: Given a set of real molecules, we want to learn the distribution of them and get new molecules
with similar properties. See the `Evaluation` section for more details.

## Dataset

### Preprocessing

With our implementation, this model has several limitations:
1. Information about protonation and chirality are ignored during generation
2. Molecules consisting of `[N+]`, `[O-]`, etc. cannot be generated.

For example, the model can only generate `O=C1NC(=S)NC(=O)C1=CNC1=CC=C(N(=O)O)C=C1O` from 
`O=C1NC(=S)NC(=O)C1=CNC1=CC=C([N+](=O)[O-])C=C1O` even with the correct decisions.

To avoid issues about validity and novelty, we filter out these molecules from the dataset.

### ChEMBL

The authors use the [ChEMBL database](https://www.ebi.ac.uk/chembl/). Since they 
did not release the code, we use a subset from [Olivecrona et al.](https://github.com/MarcusOlivecrona/REINVENT), 
another work on generative modeling. 

The authors restrict their dataset to molecules with at most 20 heavy atoms, and used a training/validation
split of 130, 830/26, 166 examples each. We use the same split but need to relax 20 to 23 as we are using
a different subset.

### ZINC

After the pre-processing, we are left with 232464 molecules for training and 5000 molecules for validation.

## Usage

### Training

Training auto-regressive generative models tends to be very slow. According to the authors, they use multiprocess to
speed up training and gpu does not give much speed advantage. We follow their approach and perform multiprocess cpu
training.

To start training, use `train.py` with required arguments
```
-d DATASET, dataset to use (default: None), built-in support exists for ChEMBL, ZINC
-o {random,canonical}, order to generate graphs (default: None)
```

and optional arguments
```
-s SEED,  random seed (default: 0)
-np NUM_PROCESSES, number of processes to use (default: 32)
```

Even though multiprocess yields a significant speedup comparing to a single process, the training can still take a long 
time (several days). An epoch of training and validation can take up to one hour and a half on our machine. If not 
necessary, we recommend users use our pre-trained models. 

Meanwhile, we make a checkpoint of our model whenever there is a performance improvement on the validation set so you 
do not need to wait until the training terminates.

All training results can be found in `training_results`.

#### Dataset configuration

You can also use your own dataset with additional arguments
```
-tf TRAIN_FILE, Path to a file with one SMILES a line for training
                data. This is only necessary if you want to use a new
                dataset. (default: None)
-vf VAL_FILE, Path to a file with one SMILES a line for validation
              data. This is only necessary if you want to use a new
              dataset. (default: None)
```

#### Monitoring

We can monitor the training process with tensorboard as below:

86
![](https://data.dgl.ai/dgllife/dgmg/tensorboard.png)
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126

To use tensorboard, you need to install [tensorboardX](https://github.com/lanpa/tensorboardX) and 
[TensorFlow](https://www.tensorflow.org/). You can lunch tensorboard with `tensorboard --logdir=.`

If you are training on a remote server, you can still use it with:
1. Launch it on the remote server with `tensorboard --logdir=. --port=A`
2. In the terminal of your local machine, type `ssh -NfL localhost:B:localhost:A username@your_remote_host_name`
3. Go to the address `localhost:B` in your browser

### Evaluation

To start evaluation, use `eval.py` with required arguments
```
-d DATASET, dataset to use (default: None), built-in support exists for ChEMBL, ZINC
-o {random,canonical}, order to generate graphs, used for naming evaluation directory (default: None)
-p MODEL_PATH, path to saved model (default: None). This is not needed if you want to use pretrained models.
-pr, Whether to use a pre-trained model (default: False)
```

and optional arguments
```
-s SEED, random seed (default: 0)
-ns NUM_SAMPLES, Number of molecules to generate (default: 100000)
-mn MAX_NUM_STEPS, Max number of steps allowed in generated molecules to
                   ensure termination (default: 400)
-np NUM_PROCESSES, number of processes to use (default: 32)
-gt GENERATION_TIME, max time (seconds) allowed for generation with
                     multiprocess (default: 600)
```

All evaluation results can be found in `eval_results`.

After the evaluation, 100000 molecules will be generated and stored in `generated_smiles.txt` under `eval_results`
directory, with three statistics logged in `generation_stats.txt` under `eval_results`:
1. `Validity among all` gives the percentage of molecules that are valid
2. `Uniqueness among valid ones` gives the percentage of valid molecules that are unique
3. `Novelty among unique ones` gives the percentage of unique valid molecules that are novel (not seen in training data)

We also provide a jupyter notebook where you can visualize the generated molecules 

127
![](https://data.dgl.ai/dgllife/dgmg/DGMG_ZINC_canonical_vis.png)
128
129
130

and compare their property distributions against the training molecule property distributions

131
![](https://data.dgl.ai/dgllife/dgmg/DGMG_ZINC_canonical_dist.png)
132

133
You can download the notebook with `wget https://data.dgl.ai/dgllife/dgmg/eval_jupyter.ipynb`.
134
135
136
137
138
139
140
141
142
143
144
145

### Pre-trained models

Below gives the statistics of pre-trained models. With random order, the training becomes significantly more difficult 
as we now have `N^2` data points with `N` molecules.

| Pre-trained model  | % valid | % unique among valid | % novel among unique |
| ------------------ | ------- | -------------------- | -------------------- |
| `ChEMBL_canonical` | 78.80   | 99.19                | 98.60                |            
| `ChEMBL_random`    | 29.09   | 99.87                | 100.00               |
| `ZINC_canonical`   | 74.60   | 99.87                | 99.87                |
| `ZINC_random`      | 12.37   | 99.38                | 100.00               |