"vscode:/vscode.git/clone" did not exist on "fb02f39ad8736da962951ecf54658dd1881b902f"
Training_OpenFold.md 7.12 KB
Newer Older
1
2
3
4
5
6
7
8
# Training OpenFold
## Background

This guide covers how to train an OpenFold model. These instructions focus on training a model for predicting monomers, but additional instructions are provided for training a monomer / multimer model. 

### Pre-requisites: 

This guide requires the following:
jnwei's avatar
jnwei committed
9
- [Installation of OpenFold and dependencies](Installation.md) (Including jackhmmer and hhblits depedencies)
10
- A preprocessed dataset:
jnwei's avatar
jnwei committed
11
	- For this guide, we will use the original OpenFold dataset which is available on RODA (TODO: add link to processed dataset).
12
13
14
	- If you wish to construct your own dataset, [these instructions](OpenFold_Training_Setup.md) provide guidance for preprocessing alignments into an OpenFold format. 
- GPUs configured with CUDA. Training OpenFold with CPUs only is not supported. 

jnwei's avatar
jnwei committed
15
Expected data directory structure:
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
```
- OpenProteinSet 
  └── alignments 
	  └── 2x7l_M
		  └── mgnify_hits.a3m
		  └── bfd_uniclust_hits.a3m
		  └── uniref90_hits.a3m
		  └── pdb70_hits.hhr 
		...
  └── mmcifs 
	  └── 3u8d.cif
	  └── 3lrm.cif
	  ... 
  └── mmcif_cache.json 
  └── chain_data_cache.json 
```

The `mmcif_cache.json` and the `chain_data_cache.json` provide metadata for the mmcif and the protein chains in the dataset.

## Training a new OpenFold model 

#### Basic command
The basic command to train a new OpenFold model is 
```
python3 train_openfold.py $DATA_DIR/mmcifs/ $DATA_DIR/alignments/ template_mmcif_dir/ $OUTPUT_DIR \
    --max_template_date 2021-10-10 \ 
    --train_chain_data_cache_path chain_data_cache.json \
    --template_release_dates_cache_path mmcif_cache.json \ 
	--config_preset initial_training \
    --seed 42 \
    --obsolete_pdbs_file_path obsolete.dat \
    --num_nodes 1 \
    --gpus 4 \
    --num_workers 4 \
```

The required arguments are:
- `mmcif_dir` : Mmcif files for the training set.
- `alignment_dir`: Alignments for the sequences in `mmcif_dir`, see expected directory structure 
- `template_mmcif_dir`:  Template mmcif files with structures, which can be the same directory as mmcif_dir. The `max_template_date` and `template_release_dates_cache_path` will specify which templates will be allowed based on a date cutoff
- `$OUTPUT_DIR` : Where model checkpoint files and other outputs will be saved. 

Commonly used flags include:
- `config_preset`: Specifies which selection of hyperparameters should be used for initial model training. Commonly used configs are defined in `openfold/config.py` 
- `num_nodes` and `gpus`:  Specifies number of nodes and GPUs available to train OpenFold.
- `seed` - Specifies random seed
- `num_workers`: Number of CPU workers to assign for creating dataset examples
- `obsolete_pdbs_file_path`: Specifies obsolete pdb IDs that should be excluded from training.
- `val_data_dir` and `val_alignment_dir`: Specifies data directory and alignments for validation dataset. 

```{note}
Note that `--seed` must be specified to correctly configure training examples on multi-GPU training runs
```



#### Train OpenFold with Different Dataset Configurations

If the [OpenFold alignment database](OpenFold_Training_Setup.md#2-creating-alignment-dbs-optional) setup is used, the training command will instead look like this: 






#### Additional command line flag options:

Here we provide brief descriptions for customizing your training run of OpenFold. A full description of all flags can be accessed by using the `--help` option in the script 

- **Use Deepspeed acceleration strategy:** `--deepspeed_config` This option configures OpenFold to use custom Deepspeed kernels. This option requires a deepspeed_config.json, you can create your own, or use the one in the OpenFold directory 

- **Use a validation dataset:** Specify validation database paths with `--val_data_dir` + `--val_alignment_dir`. Validation metrics will be evaluated on these datasets.

- **Use a self-distillation dataset:**  Specify paths with `--distillation_data_dir` and `--distillation_alignment_dir` flags

- **Change specific parameters in the model or data setup:**  `--experiment_config_json`. These parameters must be defined in the [`openfold/config.py`](https://github.com/aqlaboratory/openfold/blob/main/openfold/config.py). For example to change the crop size for training a model, you can write the following json:
	```cropsize.json
	{
			"data.train.crop_size": 128
	}
	```

- **Configure training settings with PyTorch Lightning** 
	
	Some flags e.g. `--precision`, `--max_epochs` configure training behavior. See the Pytorch Lightning Trainer args section in the `--help`  menu for more information and consult [Pytorch lightning documentation](https://lightning.ai/docs/pytorch/stable/)
	
	- Precision: On A100s, OpenFold training works best with bfloat 16 precision (e.g. `--precision bf16-mixed`) 
	
- **Restart training from an existing checkpoint:** Use the `--resume_from_ckpt` to restart training from an existing checkpoint.

## Advanced Training Configurations 

### Training OpenFold Multimer 

At this time, we do not have a multimer training set available. To prepare your own multimer training set, please see the instructions at [Data Processing - multimer] 

The basic command for training a multimer model is then:

```
multimer training command here
```

The key differences are:
- Dataset configuration / preparation

### Fine tuning from existing model weights 

If you have existing model weights, you can fine tune the model using the following command:

```
python3 train_openfold.py mmcif_dir/ alignment_dir/ template_mmcif_dir/ $OUTPUT_DIR \
    --max_template_date 2021-10-10 \ 
    --train_chain_data_cache_path chain_data_cache.json \
    --template_release_dates_cache_path mmcif_cache.json \ 
	--config_preset finetuning \
    --seed 4242022 \
    --obsolete_pdbs_file_path obsolete.dat \
    --num_nodes 1 \
    --gpus 4 \
    --num_workers 4 \
	--resume_from_ckpt $CHECKPOINT_PATH
	--resume_model_weights_only
```

If you have model parameters from OpenFold v1.x, you may need to convert your checkpoint file or parameter. See [[Converting OpenFold v1 Weights]]  

### Using MPI

If MPI is configured on your system, and you would like to use MPI to train OpenFold models, you may do so with the following step:

 1. Add the `mpi4py` package, which are available through pip and conda. Please see [mpi4py documentation](https://pypi.org/project/mpi4py/) for more instructions on installation.
2. Add the `--mpi_plugin` flag to your training command.


## Troubleshooting FAQ

My model training is hanging on the data loading step:
	 While each system is different, a few general suggestions:
		 Check your `$KMP_AFFINITY` environment setting
		 Adjust the number of data workers used to prepare data with the `--num_workers` setting. Increasing the number could help with dataset processing speed. However, to many workers could cause an OOM issue. 

When I reload my pretrained model weights or checkpoints, I get `RuntimeError: Error(s) in loading state_dict for OpenFoldWrapper: Unexpected key(s) in state_dict:`
jnwei's avatar
jnwei committed
158
	This suggests that your checkpoint / model weights are in OpenFold v1 format with outdated model layer names. Convert your weights/checkpoints following [this guide](convert_of_v1_weights.md).