README.md

# Junction Tree Variational Autoencoder for Molecular Graph Generation (JTNN)

Wengong Jin, Regina Barzilay, Tommi Jaakkola. 
Junction Tree Variational Autoencoder for Molecular Graph Generation. 
*arXiv preprint arXiv:1802.04364*, 2018.

JTNN uses algorithm called junction tree algorithm to form a tree from the molecular graph. 
Then the model will encode the tree and graph into two separate vectors `z_G` and `z_T`. Details can
be found in original paper. The brief process is as below (from original paper): 

![image](https://user-images.githubusercontent.com/8686776/63677300-3fb6d980-c81f-11e9-8a65-57c8b03aaf52.png)

**Goal**: JTNN is an auto-encoder model, aiming to learn hidden representation for molecular graphs. 
These representations can be used for downstream tasks, such as property prediction, or molecule optimizations.

## Dataset

### ZINC

> The ZINC database is a curated collection of commercially available chemical compounds 
prepared especially for virtual screening. (introduction from Wikipedia)

Generally speaking, molecules in the ZINC dataset are more drug-like. We uses ~220,000 
molecules for training and 5000 molecules for validation. 

### Preprocessing

Class `JTNNDataset` will process a SMILES string into a dict, consisting of a junction tree, a graph with 
encoded nodes(atoms) and edges(bonds), and other information for model to use.

## Usage

### Training

To start training, use `python train.py`. By default, the script will use ZINC dataset
 with preprocessed vocabulary, and save model checkpoint periodically in the current working directory. 

### Evaluation

To start evaluation, use `python reconstruct_eval.py`. By default, we will perform evaluation with 
DGL's pre-trained model. During the evaluation, the program will print out the success rate of 
molecule reconstruction.

### Pre-trained models

Below gives the statistics of our pre-trained `JTNN_ZINC` model. 

| Pre-trained model  | % Reconstruction Accuracy
| ------------------ | -------
| `JTNN_ZINC`        |  73.7             

### Visualization

Here we draw some "neighbor" of a given molecule, by adding noises on the intermediate representations. 
You can download the script with `wget https://data.dgl.ai/dgllife/jtnn_viz_neighbor_mol.ipynb`. 
Please put this script at the current directory (`examples/pytorch/model_zoo/chem/generative_models/jtnn/`).

#### Given Molecule
![image](https://user-images.githubusercontent.com/8686776/63773593-0d37da00-c90e-11e9-8933-0abca4b430db.png)
#### Neighbor Molecules
![image](https://user-images.githubusercontent.com/8686776/63773602-1163f780-c90e-11e9-8341-5122dc0d0c82.png)

### Dataset configuration

If you want to use your own dataset, please create a file with one SMILES a line as below

```
CCO
Fc1ccccc1
```

You can generate the vocabulary file corresponding to your dataset with `python vocab.py -d X -v Y`, where `X` 
is the path to the dataset and `Y` is the path to the vocabulary file to save. An example vocabulary file 
corresponding to the two molecules above will be

```
CC
CF
C1=CC=CC=C1
CO
```

If you want to develop a model based on DGL's pre-trained model, it's important to make sure that the vocabulary 
generated above is a subset of the vocabulary we use for the pre-trained model. By running `vocab.py` above, we 
also check if the new vocabulary is a subset of the vocabulary we use for the pre-trained model and print the 
result in the terminal as follows:

```
The new vocabulary is a subset of the default vocabulary: True
```

To train on this new dataset, run

```
python train.py -t X
```

where `X` is the path to the new dataset. If you want to use the vocabulary generated above, also add `-v Y`, where 
`Y` is the path to the vocabulary file we just saved.

To evaluate on this new dataset, run `python reconstruct_eval.py` with arguments same as above.