_Figure: Comparison of OpenFold and AlphaFold2 predictions to the experimental structure of PDB 7KDX, chain B._
# OpenFold
A faithful but trainable PyTorch reproduction of DeepMind's
...
...
@@ -10,6 +9,8 @@ A faithful but trainable PyTorch reproduction of DeepMind's
# Documentation
See our new home for docs at [openfold.readthedocs.io](https://openfold.readthedocs.io/en/latest/), with instructions for installation and model inference/training.
Much of the content from this page may be found [here.](https://github.com/aqlaboratory/openfold/blob/main/docs/source/original_readme.md)
## Copyright Notice
While AlphaFold's and, by extension, OpenFold's source code is licensed under
@@ -4,18 +4,20 @@ The multiple sequence alignments of OpenProteinSet and mmCIF structure files req
### Pre-Requisites:
- OpenFold conda environment. See [OpenFold Installation](Installation.md) for instructions on how to build this environment.
- In particular, the [AWS CLI](https://aws.amazon.com/cli/) is used to download data from RODA.
- For this guide, we assume that the OpenFold codebase is located at `$OF_DIR`.
## 1. Downloading alignments and structure files
To fetch all the alignments corresponding to the original PDB training set of OpenFold alongside their mmCIF 3D structures, you can run the following commands:
The nested alignment directory structure is not yet exactly what OpenFold expects, so you can run the `flatten_roda.sh` script to convert them to the correct format:
As a last step, OpenFold requires ["cache" files](Aux_seq_files.md#chain-cache-files-and-mmcif-cache-files) with metadata information for each chain that are used for choosing templates and samples during training.
The mmCIF-cache is used for filtering templates and can be generated with the following script:
The data caches for OpenProteinSet can be downloaded from RODA with the following:
This guide covers how to train an OpenFold model. These instructions focus on training a model for predicting monomers, but additional instructions are provided for training a monomer / multimer model.
This guide covers how to train an OpenFold model for monomers. Some additional instructions are provided at the end for fine-tuning your model.
### Pre-requisites:
This guide requires the following:
-[Installation of OpenFold and dependencies](Installation.md)(Including jackhmmer and hhblits depedencies)
- A preprocessed dataset:
- For this guide, we will use the original OpenFold dataset which is available on RODA (TODO: add link to processed dataset).
- If you wish to construct your own dataset, [these instructions](OpenFold_Training_Setup.md) provide guidance for preprocessing alignments into an OpenFold format.
- For this guide, we will use the original OpenFold dataset which is available on RODA, processed with [these instructions](OpenFold_Training_Setup.md).
- GPUs configured with CUDA. Training OpenFold with CPUs only is not supported.
Expected data directory structure:
## Training a new OpenFold model
#### Basic command
For a dataset that has the default alignment file structure, e.g.
```
- OpenProteinSet
└── alignments
└── 2x7l_M
└── mgnify_hits.a3m
└── bfd_uniclust_hits.a3m
└── uniref90_hits.a3m
└── pdb70_hits.hhr
-$DATA_DIR
└── pdb_data
├── mmcifs
├── 3lrm.cif
└── 6kwc.cif
...
└── mmcifs
└── 3u8d.cif
└── 3lrm.cif
├── obsolete.dat
├── duplicate_pdb_chains.txt
└── data_caches
├── duplicate_pdb_chains.txt
└── data_caches
└── alignment_data
└── alignments
├── 3lrm_A/
├── 3lrm_B/
└── 6kwc_A/
...
└── mmcif_cache.json
└── chain_data_cache.json
```
The `mmcif_cache.json` and the `chain_data_cache.json` provide metadata for the mmcif and the protein chains in the dataset.
The basic command to train a new OpenFold model is:
## Training a new OpenFold model
#### Basic command
The basic command to train a new OpenFold model is
-`alignment_dir`: Alignments for the sequences in `mmcif_dir`, see expected directory structure
-`alignments_dir`: Alignments for the sequences in `mmcif_dir`, see expected directory structure
-`template_mmcif_dir`: Template mmcif files with structures, which can be the same directory as mmcif_dir. The `max_template_date` and `template_release_dates_cache_path` will specify which templates will be allowed based on a date cutoff
-`$OUTPUT_DIR` : Where model checkpoint files and other outputs will be saved.
-`output_dir` : Where model checkpoint files and other outputs will be saved.
Commonly used flags include:
-`config_preset`: Specifies which selection of hyperparameters should be used for initial model training. Commonly used configs are defined in `openfold/config.py`
-`config_preset`: Specifies which selection of hyperparameters should be used for initial model training. Commonly used configs are defined in [`openfold/config.py`](https://github.com/aqlaboratory/openfold)
-`num_nodes` and `gpus`: Specifies number of nodes and GPUs available to train OpenFold.
-`seed` - Specifies random seed
-`num_workers`: Number of CPU workers to assign for creating dataset examples
...
...
@@ -67,16 +70,40 @@ Commonly used flags include:
Note that `--seed` must be specified to correctly configure training examples on multi-GPU training runs
```
#### Train with OpenFold Dataset Configuration
If the [OpenFold alignment database](OpenFold_Training_Setup.md#2-creating-alignment-dbs-optional) setup is used, resulting in a data directory such as:
```
- $DATA_DIR
├── duplicate_pdb_chains.txt
├── pdb_data
└── mmcifs
├── 3lrm.cif
└── 6kwc.cif
└── alignment_data
└── alignment_db
├── alignment_db_0.db
├── alignment_db_1.db
...
├── alignment_db_9.db
└── alignment_db.index
```
#### Train OpenFold with Different Dataset Configurations
If the [OpenFold alignment database](OpenFold_Training_Setup.md#2-creating-alignment-dbs-optional) setup is used, the training command will instead look like this:
The training command will use the `alignment_index_path` argument to specify `db.index` files, e.g.:
@@ -104,40 +131,29 @@ Here we provide brief descriptions for customizing your training run of OpenFold
-**Restart training from an existing checkpoint:** Use the `--resume_from_ckpt` to restart training from an existing checkpoint.
## Advanced Training Configurations
### Training OpenFold Multimer
At this time, we do not have a multimer training set available. To prepare your own multimer training set, please see the instructions at [Data Processing - multimer]
The basic command for training a multimer model is then:
```
multimer training command here
```
The key differences are:
- Dataset configuration / preparation
:::
### Fine tuning from existing model weights
If you have existing model weights, you can fine tune the model using the following command:
If you have existing model weights, you can fine tune the model by specifying a checkpoint path with `--resume_from_ckpt` and `--resume_model_weights_only` arguments, e.g.
If you have model parameters from OpenFold v1.x, you may need to convert your checkpoint file or parameter. See [[Converting OpenFold v1 Weights]]
If you have model parameters from OpenFold v1.x, you may need to convert your checkpoint file or parameter. See [Converting OpenFold v1 Weights](convert_of_v1_weights.md) for more details.
### Using MPI
...
...
@@ -145,3 +161,10 @@ If MPI is configured on your system, and you would like to use MPI to train Open
1. Add the `mpi4py` package, which are available through pip and conda. Please see [mpi4py documentation](https://pypi.org/project/mpi4py/) for more instructions on installation.
2. Add the `--mpi_plugin` flag to your training command.
-[Getting started with OpenFold](installation.md)!
- Learn how to [run inference with OpenFold](Inference.md)
-[Train your own OpenFold models](Training_OpenFold.md)
- Find guidance for setup and running OpenFold in the [FAQ](FAQ.md).
Some portions of the documentation are still under migration from the original README, which can be found [here](original_readme.md)
We also have a [Colab notebook](https://colab.research.google.com/github/aqlaboratory/openfold/blob/main/notebooks/OpenFold.ipynb) that can be used for single structure / multimer prediction.
Some portions of the documentation are still under migration from the original README, which can be found [here](original_readme.md).