Training_OpenFold.md 7.21 KB
Newer Older
1
2
3
# Training OpenFold
## Background

4
This guide covers how to train an OpenFold model for monomers. Some additional instructions are provided at the end for fine-tuning your model.
5
6
7
8

### Pre-requisites: 

This guide requires the following:
jnwei's avatar
jnwei committed
9
- [Installation of OpenFold and dependencies](Installation.md) (Including jackhmmer and hhblits depedencies)
10
- A preprocessed dataset:
11
	- For this guide, we will use the original OpenFold dataset which is available on RODA, processed with [these instructions](OpenFold_Training_Setup.md).
12
13
- GPUs configured with CUDA. Training OpenFold with CPUs only is not supported. 

14
15
16
17
18
19
## Training a new OpenFold model 

#### Basic command

For a dataset that has the default alignment file structure, e.g. 

20
```
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
-$DATA_DIR
  └── pdb_data 
 	   ├──  mmcifs 
 	   	    ├── 3lrm.cif
 	   	    └── 6kwc.cif
			...
	   ├── obsolete.dat 	
	   ├── duplicate_pdb_chains.txt 
	   └── data_caches 
	   		├── duplicate_pdb_chains.txt 
	   		└── data_caches 
  └── alignment_data
	  └── alignments
   	  		├── 3lrm_A/ 
   	  		├── 3lrm_B/ 
	  	    └── 6kwc_A/
	  		...
38
39
```

40
The basic command to train a new OpenFold model is: 
41
42

```
43
python3 train_openfold.py $DATA_DIR/pdb/mmcifs $DATA_DIR/alignment_data/alignments $TEMPLATE_MMCIF_DIR $OUTPUT_DIR \
44
    --max_template_date 2021-10-10 \ 
45
46
    --train_chain_data_cache_path $DATA_DIR/pdb_data/data_caches/chain_data_cache.json \
    --template_release_dates_cache_path $DATA_DIR/pdb_data/data_caches/mmcif_cache.json \ 
47
48
	--config_preset initial_training \
    --seed 42 \
49
    --obsolete_pdbs_file_path $DATA_DIR/pdb_data/obsolete.dat \
50
51
    --num_nodes 1 \
    --gpus 4 \
52
    --num_workers 4
53
54
55
56
```

The required arguments are:
- `mmcif_dir` : Mmcif files for the training set.
57
- `alignments_dir`: Alignments for the sequences in `mmcif_dir`, see expected directory structure 
58
- `template_mmcif_dir`:  Template mmcif files with structures, which can be the same directory as mmcif_dir. The `max_template_date` and `template_release_dates_cache_path` will specify which templates will be allowed based on a date cutoff
59
- `output_dir` : Where model checkpoint files and other outputs will be saved. 
60
61

Commonly used flags include:
62
- `config_preset`: Specifies which selection of hyperparameters should be used for initial model training. Commonly used configs are defined in [`openfold/config.py`](https://github.com/aqlaboratory/openfold)
63
64
65
66
67
68
69
70
71
72
- `num_nodes` and `gpus`:  Specifies number of nodes and GPUs available to train OpenFold.
- `seed` - Specifies random seed
- `num_workers`: Number of CPU workers to assign for creating dataset examples
- `obsolete_pdbs_file_path`: Specifies obsolete pdb IDs that should be excluded from training.
- `val_data_dir` and `val_alignment_dir`: Specifies data directory and alignments for validation dataset. 

```{note}
Note that `--seed` must be specified to correctly configure training examples on multi-GPU training runs
```

73
#### Train with OpenFold Dataset Configuration
74

75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
If the [OpenFold alignment database](OpenFold_Training_Setup.md#2-creating-alignment-dbs-optional) setup is used, resulting in a data directory such as:
```
- $DATA_DIR 
  ├── duplicate_pdb_chains.txt
  ├── pdb_data
	  └── mmcifs 
		  ├── 3lrm.cif
		  └── 6kwc.cif
  └── alignment_data 
	  └── alignment_db
		  ├── alignment_db_0.db 
          ├── alignment_db_1.db
          ...
          ├── alignment_db_9.db
		  └── alignment_db.index 
```
91

92
The training command will use the `alignment_index_path` argument to specify `db.index` files, e.g.: 
93

94
95
96
97
98
99
100
101
102
103
104
105
106
```
python3 train_openfold.py $DATA_DIR/pdb_data/mmcifs $DATA_DIR/alignment_data/alignment_db $TEMPLATE_MMCIF_DIR $OUTPUT_DIR \
    --max_template_date 2021-10-10 \ 
    --train_chain_data_cache_path $DATA_DIR/pdb_data/data_caches/chain_data_cache.json \
    --template_release_dates_cache_path $DATA_DIR/pdb_data/data_caches/mmcif_cache.json \ 
	--alignment_index_path $DATA_DIR/pdb/alignment_db.index 
	--config_preset initial_training \
    --seed 42 \
    --obsolete_pdbs_file_path $DATA_DIR/pdb/obsolete.dat \
    --num_nodes 1 \
    --gpus 4 \
    --num_workers 4
```
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133

#### Additional command line flag options:

Here we provide brief descriptions for customizing your training run of OpenFold. A full description of all flags can be accessed by using the `--help` option in the script 

- **Use Deepspeed acceleration strategy:** `--deepspeed_config` This option configures OpenFold to use custom Deepspeed kernels. This option requires a deepspeed_config.json, you can create your own, or use the one in the OpenFold directory 

- **Use a validation dataset:** Specify validation database paths with `--val_data_dir` + `--val_alignment_dir`. Validation metrics will be evaluated on these datasets.

- **Use a self-distillation dataset:**  Specify paths with `--distillation_data_dir` and `--distillation_alignment_dir` flags

- **Change specific parameters in the model or data setup:**  `--experiment_config_json`. These parameters must be defined in the [`openfold/config.py`](https://github.com/aqlaboratory/openfold/blob/main/openfold/config.py). For example to change the crop size for training a model, you can write the following json:
	```cropsize.json
	{
			"data.train.crop_size": 128
	}
	```

- **Configure training settings with PyTorch Lightning** 
	
	Some flags e.g. `--precision`, `--max_epochs` configure training behavior. See the Pytorch Lightning Trainer args section in the `--help`  menu for more information and consult [Pytorch lightning documentation](https://lightning.ai/docs/pytorch/stable/)
	
	- Precision: On A100s, OpenFold training works best with bfloat 16 precision (e.g. `--precision bf16-mixed`) 
	
- **Restart training from an existing checkpoint:** Use the `--resume_from_ckpt` to restart training from an existing checkpoint.

## Advanced Training Configurations 
134
:::
135
136
137

### Fine tuning from existing model weights 

138
If you have existing model weights, you can fine tune the model by specifying a checkpoint path with `--resume_from_ckpt` and `--resume_model_weights_only` arguments, e.g. 
139
140

```
141
python3 train_openfold.py $DATA_DIR/mmcifs $DATA_DIR/alignment.db $TEMPLATE_MMCIF_DIR $OUTPUT_DIR \
142
143
144
145
    --max_template_date 2021-10-10 \ 
    --train_chain_data_cache_path chain_data_cache.json \
    --template_release_dates_cache_path mmcif_cache.json \ 
	--config_preset finetuning \
146
	--alignment_index_path $DATA_DIR/pdb/alignment_db.index \ 
147
148
149
150
151
    --seed 4242022 \
    --obsolete_pdbs_file_path obsolete.dat \
    --num_nodes 1 \
    --gpus 4 \
    --num_workers 4 \
152
	--resume_from_ckpt $CHECKPOINT_PATH \
153
154
155
	--resume_model_weights_only
```

156
If you have model parameters from OpenFold v1.x, you may need to convert your checkpoint file or parameter. See [Converting OpenFold v1 Weights](convert_of_v1_weights.md) for more details. 
157
158
159
160
161
162
163

### Using MPI

If MPI is configured on your system, and you would like to use MPI to train OpenFold models, you may do so with the following step:

 1. Add the `mpi4py` package, which are available through pip and conda. Please see [mpi4py documentation](https://pypi.org/project/mpi4py/) for more instructions on installation.
2. Add the `--mpi_plugin` flag to your training command.
164
165
166
167
168
169
170


### Training Multimer  models

```{note}
Coming soon.
```