In this guide, we will cover how to use OpenFold to make structure predictions.
In this guide, we will cover how to use OpenFold to make structure predictions.
## Background
## Background
We currently offer three modes of inference prediction:
We currently offer three modes of inference prediction:
...
@@ -9,7 +10,7 @@ We currently offer three modes of inference prediction:
...
@@ -9,7 +10,7 @@ We currently offer three modes of inference prediction:
- Multimer
- Multimer
- Single Sequence (Soloseq)
- Single Sequence (Soloseq)
This guide will focus on monomer prediction, the next sections will describe Multimer and Single Sequence prediction.
This guide will focus on monomer prediction, the next sections will describe [Multimer](Multimer_Inference.md) and [Single Sequence](Single_Sequence_Inference.md) prediction.
### Pre-requisites:
### Pre-requisites:
...
@@ -19,7 +20,7 @@ This guide will focus on monomer prediction, the next sections will describe Mul
...
@@ -19,7 +20,7 @@ This guide will focus on monomer prediction, the next sections will describe Mul
## Running AlphaFold Model Inference
## Running AlphaFold Model Inference
The script `run_pretrained_openfold.py` performs model inference. We will go through the steps of how to use this script.
The script [`run_pretrained_openfold.py`](https://github.com/aqlaboratory/openfold/blob/main/run_pretrained_openfold.py) performs model inference. We will go through the steps of how to use this script.
### Download Model Parameters
### Download Model Parameters
...
@@ -45,7 +46,7 @@ If you choose to use a different directory, you may make a symlink to the `openf
...
@@ -45,7 +46,7 @@ If you choose to use a different directory, you may make a symlink to the `openf
### Model Inference
### Model Inference
The input to `run_pretrained_openfold.py` is a directory of FASTA files. AlphaFold-style models also require a sequence alignment to perform inference.
The input to [`run_pretrained_openfold.py`](https://github.com/aqlaboratory/openfold/blob/main/run_pretrained_openfold.py) is a directory of FASTA files. AlphaFold-style models also require a sequence alignment to perform inference.
If you do not have sequence alignments for your input sequences, you can compute them using the inference script directly by following the instructions for the following section [inference without pre-computed alignments](#model-inference-without-pre-computed-alignments).
If you do not have sequence alignments for your input sequences, you can compute them using the inference script directly by following the instructions for the following section [inference without pre-computed alignments](#model-inference-without-pre-computed-alignments).
...
@@ -92,7 +93,7 @@ where `${PRECOMPUTED_ALIGNMENTS}` is a directory that contains alignments. A sam
...
@@ -92,7 +93,7 @@ where `${PRECOMPUTED_ALIGNMENTS}` is a directory that contains alignments. A sam
@@ -48,7 +48,9 @@ and [UniRef30](https://uniclust.mmseqs.com/) (previously UniClust30) databases.
...
@@ -48,7 +48,9 @@ and [UniRef30](https://uniclust.mmseqs.com/) (previously UniClust30) databases.
Multimer inference can also run with the older database versions if desired.
Multimer inference can also run with the older database versions if desired.
```
```
## Inference command
## Running Multimer Inference
The [`run_pretrained_openfold.py`](https://github.com/aqlaboratory/openfold/blob/main/run_pretrained_openfold.py) script can be used to run multimer inference with the follwoing command.
# Setting up the OpenFold PDB training set from RODA
# Setting up the OpenFold PDB training set from RODA
The multiple sequence alignments of OpenProteinSet and mmCIF structure files required to train OpenFold are freely available at the [Registry of Open Data on AWS (RODA)](https://registry.opendata.aws/openfold/). Additionally, OpenFold requires some postprocessing and [auxiliary files](Aux_seq_files.md) for training that need to be generated from the AWS data manually. This documentation is intended to give a full overview of those steps starting from the data download, assuming that the OpenFold codebase has already been set up on your system previously at the path `$OF_DIR` and the `openfold` environment is activated.
The multiple sequence alignments of OpenProteinSet and mmCIF structure files required to train OpenFold are freely available at the [Registry of Open Data on AWS (RODA)](https://registry.opendata.aws/openfold/). Additionally, OpenFold requires some postprocessing and [auxiliary files](Aux_seq_files.md) for training that need to be generated from the AWS data manually. This documentation is intended to give a full overview of those steps starting from the data download.
### Pre-Requisites:
- OpenFold conda environment. See [OpenFold Installation](Installation.md) for instructions on how to build this environment.
- For this guide, we assume that the OpenFold codebase is located at `$OF_DIR`.
## 1. Downloading alignments and structure files
## 1. Downloading alignments and structure files
To fetch all the alignments corresponding to the original PDB training set of OpenFold alongside their mmCIF 3D structures, you can run the following commands:
To fetch all the alignments corresponding to the original PDB training set of OpenFold alongside their mmCIF 3D structures, you can run the following commands:
As an optional check, the following command should return 634,434:
As an optional check, the following command should return $634,434$:
```bash
```bash
ls alignment_data/alignments/ | wc-l
ls alignment_data/alignments/ | wc-l
```
```
## 4. Generating cluster-files
## 4. Generating cluster-files
[JENNIFER: We could simply upload this cluster file as well and having it set up by the user is pretty unnecessary, but the scripts would still be useful to have for setup of new datasets]\
The AlphaFold dataloader adjusts the sampling probability of chains by their inverse cluster size, so we need to generate these sequence clusters for our training set.
The AlphaFold dataloader adjusts the sampling probability of chains by their inverse cluster size, so we need to generate these sequence clusters for our training set.
As a first step, we'll need a `.fasta` file of all sequences in the training set. This can be generated with the following scripts, depending on how you set up your alignment data in the previous steps:
As a first step, we'll need a `.fasta` file of all sequences in the training set. This can be generated with the following scripts, depending on how you set up your alignment data in the previous steps:
[JENNIFER: These scripts replace `data_dir_to_fasta.py` which is horribly slow as it reparses all mmCIF structure files. What those scripts do instead is fetch the sequence information from the >query line in the MSA files which is a lot faster. Technically this means that the chains in the generated .fasta may not 100% mirror the available mmCIF files in the PDB directory if there were some MSA generation errors for some of those structures. However as OF only trains on the structures with alignments available this is fine in practice, but I just wanted to note that the output of `data_dir_to_fasta.py` may technically not be 100% identical.]
**Use this if you set up the `alignment_db` files:**
**Use this if you set up the `alignment_db` files:**
Next, we need to generate a cluster file at 40% sequence identity, which will contain all chains in a particular cluster on the same line. You'll need [MMSeqs2](https://github.com/soedinglab/MMseqs2?tab=readme-ov-file#installation) for this as well, which can be set up either in a conda environment or as a binary. (JENNIFER: do we want to add that as a dependency?)
Next, we need to generate a cluster file at 40% sequence identity, which will contain all chains in a particular cluster on the same line. You'll need [MMSeqs2](https://github.com/soedinglab/MMseqs2?tab=readme-ov-file#installation) for this as well, which can be set up either in a conda environment or as a binary.
As a last step, OpenFold requires "cache" files (JENNIFER: insert cross-link here) with metadata information for each chain that are used for choosing templates and samples during training.
As a last step, OpenFold requires ["cache" files](Aux_seq_files.md#chain-cache-files-and-mmcif-cache-files) with metadata information for each chain that are used for choosing templates and samples during training.
The mmCIF-cache is used for filtering templates and can be generated with the following script:
The mmCIF-cache is used for filtering templates and can be generated with the following script:
@@ -4,7 +4,7 @@ MSA-free sequence to structure prediction using the [ESM-1b model](https://githu
...
@@ -4,7 +4,7 @@ MSA-free sequence to structure prediction using the [ESM-1b model](https://githu
To run inference for a sequence using the SoloSeq single-sequence model, you can either precompute ESM-1b embeddings in bulk, or you can generate them during inference.
To run inference for a sequence using the SoloSeq single-sequence model, you can either precompute ESM-1b embeddings in bulk, or you can generate them during inference.
For generating ESM-1b embeddings in bulk, use the provided script: `scripts/precompute_embeddings.py`. The script takes a directory of FASTA files (one sequence per file) and generates ESM-1b embeddings in the same format and directory structure as required by SoloSeq. Following is an example command to use the script:
For generating ESM-1b embeddings in bulk, use the provided script: [`scripts/precompute_embeddings.py`](https://github.com/aqlaboratory/openfold/blob/main/scripts/precompute_embeddings.py). The script takes a directory of FASTA files (one sequence per file) and generates ESM-1b embeddings in the same format and directory structure as required by SoloSeq. Following is an example command to use the script:
@@ -8,12 +8,11 @@ This guide covers how to train an OpenFold model. These instructions focus on tr
...
@@ -8,12 +8,11 @@ This guide covers how to train an OpenFold model. These instructions focus on tr
This guide requires the following:
This guide requires the following:
-[Installation of OpenFold and dependencies](Installation.md)(Including jackhmmer and hhblits depedencies)
-[Installation of OpenFold and dependencies](Installation.md)(Including jackhmmer and hhblits depedencies)
- A preprocessed dataset:
- A preprocessed dataset:
- For this guide, we will use the original OpenFold dataset which is available on RODA. This dataset can be downloaded with the following command:
- For this guide, we will use the original OpenFold dataset which is available on RODA (TODO: add link to processed dataset).
`./scripts/download_roda_dbs.sh <dst_path>`[Download the dataset used to train the OpenFold model]
- If you wish to construct your own dataset, [these instructions](OpenFold_Training_Setup.md) provide guidance for preprocessing alignments into an OpenFold format.
- If you wish to construct your own dataset, [these instructions](OpenFold_Training_Setup.md) provide guidance for preprocessing alignments into an OpenFold format.
- GPUs configured with CUDA. Training OpenFold with CPUs only is not supported.
- GPUs configured with CUDA. Training OpenFold with CPUs only is not supported.
Expected directory structure:
Expected data directory structure:
```
```
- OpenProteinSet
- OpenProteinSet
└── alignments
└── alignments
...
@@ -156,4 +155,4 @@ My model training is hanging on the data loading step:
...
@@ -156,4 +155,4 @@ My model training is hanging on the data loading step:
Adjust the number of data workers used to prepare data with the `--num_workers` setting. Increasing the number could help with dataset processing speed. However, to many workers could cause an OOM issue.
Adjust the number of data workers used to prepare data with the `--num_workers` setting. Increasing the number could help with dataset processing speed. However, to many workers could cause an OOM issue.
When I reload my pretrained model weights or checkpoints, I get `RuntimeError: Error(s) in loading state_dict for OpenFoldWrapper: Unexpected key(s) in state_dict:`
When I reload my pretrained model weights or checkpoints, I get `RuntimeError: Error(s) in loading state_dict for OpenFoldWrapper: Unexpected key(s) in state_dict:`
This suggests that your checkpoint / model weights are in OpenFold v1 format with outdated model layer names. Convert your weights/checkpoints following the [[Converting OpenFold v1 Weights]]
This suggests that your checkpoint / model weights are in OpenFold v1 format with outdated model layer names. Convert your weights/checkpoints following [this guide](convert_of_v1_weights.md).
As part of the [OpenFold v2 update](https://github.com/aqlaboratory/openfold/releases/tag/v2.0.0) with the integration of multimer prediction, certain model layers of the AlphaFold model were renamed. For example.
`module.model.template_angle_embedder.*` is now referred to as
If you have some checkpoints that were trained using OpenFold v1 or older, and now want to resume training on OpenFold v2, you may need to convert your checkpoints.
## FAQ
### Do I need to convert my checkpoints / model weights?
If you want to run inference or resume training from a checkpoint that was trained with OpenFold V1, you will need to convert your checkpoint.
If you want load model weights only, without starting from a specific time step, then you should not need to convert your checkpoints. The training of the model will begin from `step=0` in this case. To do so, you'll need both the `--resume_from_ckpt` and `--resume_model_weights_only` flags. This example allows you train starting from the pre-trained openfold weights:
@@ -4,7 +4,7 @@ In this guide, we will OpenFold and its dependencies.
...
@@ -4,7 +4,7 @@ In this guide, we will OpenFold and its dependencies.
**Pre-requisites**
**Pre-requisites**
This package is currently supported for CUDA 12 and Pytorch 2.1. All dependencies are listed in the `environment.yml`
This package is currently supported for CUDA 11 and Pytorch 1.12. All dependencies are listed in the [`environment.yml`](https://github.com/aqlaboratory/openfold/blob/main/environment.yml)