-[Building and Using the Docker Container](#building-and-using-the-docker-container)
-[Copyright Notice](#copyright-notice)
-[Contributing](#contributing)
-[Citing this Work](#citing-this-work)
## Features
## Features
OpenFold carefully reproduces (almost) all of the features of the original open
OpenFold carefully reproduces (almost) all of the features of the original open
...
@@ -63,7 +81,7 @@ To install:
...
@@ -63,7 +81,7 @@ To install:
For some systems, it may help to append the Conda environment library path to `$LD_LIBRARY_PATH`. The `install_third_party_dependencies.sh` script does this once, but you may need this for each bash instance.
For some systems, it may help to append the Conda environment library path to `$LD_LIBRARY_PATH`. The `install_third_party_dependencies.sh` script does this once, but you may need this for each bash instance.
## Usage
## Download Alignment Databases
If you intend to generate your own alignments, e.g. for inference, you have two
If you intend to generate your own alignments, e.g. for inference, you have two
choices for downloading protein databases, depending on whether you want to use
choices for downloading protein databases, depending on whether you want to use
...
@@ -112,7 +130,16 @@ DeepMind's pretrained parameters, you will only be able to make changes that
...
@@ -112,7 +130,16 @@ DeepMind's pretrained parameters, you will only be able to make changes that
do not affect the shapes of model parameters. For an example of initializing
do not affect the shapes of model parameters. For an example of initializing
the model, consult `run_pretrained_openfold.py`.
the model, consult `run_pretrained_openfold.py`.
### Inference
## Inference
OpenFold now supports three inference modes:
-[Monomer Inference](#monomer-inference): OpenFold reproduction of AlphaFold2. Inference available with either DeepMind's pretrained parameters or OpenFold trained parameters.
-[Multimer Inference](#multimer-inference): OpenFold reproduction of AlphaFold-Multimer. Inference available with DeepMind's pre-trained parameters.
-[Single Sequence Inference (SoloSeq)](#soloseq-inference): Language Model based structure prediction, using [ESM-1b](https://github.com/facebookresearch/esm) embeddings.
More instructions for each inference mode are provided below:
### Monomer inference
To run inference on a sequence or multiple sequences using a set of DeepMind's
To run inference on a sequence or multiple sequences using a set of DeepMind's
pretrained parameters, first download the OpenFold weights e.g.:
pretrained parameters, first download the OpenFold weights e.g.:
...
@@ -219,7 +246,7 @@ this case, the inference script runs AlphaFold-Gap, a hack proposed
...
@@ -219,7 +246,7 @@ this case, the inference script runs AlphaFold-Gap, a hack proposed
[here](https://twitter.com/minkbaek/status/1417538291709071362?lang=en), using
[here](https://twitter.com/minkbaek/status/1417538291709071362?lang=en), using
the specified stock AlphaFold/OpenFold parameters (NOT AlphaFold-Multimer).
the specified stock AlphaFold/OpenFold parameters (NOT AlphaFold-Multimer).
#### Multimer Inference
### Multimer Inference
To run inference on a complex or multiple complexes using a set of DeepMind's pretrained parameters, run e.g.:
To run inference on a complex or multiple complexes using a set of DeepMind's pretrained parameters, run e.g.:
...
@@ -247,7 +274,8 @@ As with monomer inference, if you've already computed alignments for the query,
...
@@ -247,7 +274,8 @@ As with monomer inference, if you've already computed alignments for the query,
the `--use_precomputed_alignments` option. Note that template searching in the multimer pipeline
the `--use_precomputed_alignments` option. Note that template searching in the multimer pipeline
uses HMMSearch with the PDB SeqRes database, replacing HHSearch and PDB70 used in the monomer pipeline.
uses HMMSearch with the PDB SeqRes database, replacing HHSearch and PDB70 used in the monomer pipeline.
##### Upgrades
**Upgrade from an existing OpenFold installation**
The above command requires several upgrades to existing openfold installations.
The above command requires several upgrades to existing openfold installations.
1. Re-download the alphafold parameters to get the latest
1. Re-download the alphafold parameters to get the latest
2. Download the [UniProt](https://www.uniprot.org/uniprotkb/)
2. Download the [UniProt](https://www.uniprot.org/uniprotkb/)
and [PDB SeqRes](https://www.rcsb.org/) databases:
and [PDB SeqRes](https://www.rcsb.org/) databases:
```bash
```bash
bash scripts/download_uniprot.sh data/
bash scripts/download_uniprot.sh data/
```
```
The PDB SeqRes and PDB databases must be from the same date to avoid potential
The PDB SeqRes and PDB databases must be from the same date to avoid potential
errors during template searching. Remove the existing `data/pdb_mmcif` directory
errors during template searching. Remove the existing `data/pdb_mmcif` directory
...
@@ -271,7 +299,7 @@ and [PDB SeqRes](https://www.rcsb.org/) databases:
...
@@ -271,7 +299,7 @@ and [PDB SeqRes](https://www.rcsb.org/) databases:
```bash
```bash
bash scripts/download_pdb_mmcif.sh data/
bash scripts/download_pdb_mmcif.sh data/
bash scripts/download_pdb_seqres.sh data/
bash scripts/download_pdb_seqres.sh data/
```
```
3. Additionally, AlphaFold-Multimer uses upgraded versions of the [MGnify](https://www.ebi.ac.uk/metagenomics)
3. Additionally, AlphaFold-Multimer uses upgraded versions of the [MGnify](https://www.ebi.ac.uk/metagenomics)
and [UniRef30](https://uniclust.mmseqs.com/)(previously UniClust30) databases. To download the upgraded databases, run:
and [UniRef30](https://uniclust.mmseqs.com/)(previously UniClust30) databases. To download the upgraded databases, run:
...
@@ -279,10 +307,12 @@ and [UniRef30](https://uniclust.mmseqs.com/) (previously UniClust30) databases.
...
@@ -279,10 +307,12 @@ and [UniRef30](https://uniclust.mmseqs.com/) (previously UniClust30) databases.
```bash
```bash
bash scripts/download_uniref30.sh data/
bash scripts/download_uniref30.sh data/
bash scripts/download_mgnify.sh data/
bash scripts/download_mgnify.sh data/
```
```
Multimer inference can also run with the older database versions if desired.
Multimer inference can also run with the older database versions if desired.
#### SoloSeq Inference
### Soloseq Inference
To run inference for a sequence using the SoloSeq single-sequence model, you can either precompute ESM-1b embeddings in bulk, or you can generate them during inference.
To run inference for a sequence using the SoloSeq single-sequence model, you can either precompute ESM-1b embeddings in bulk, or you can generate them during inference.
For generating ESM-1b embeddings in bulk, use the provided script: `scripts/precompute_embeddings.py`. The script takes a directory of FASTA files (one sequence per file) and generates ESM-1b embeddings in the same format and directory structure as required by SoloSeq. Following is an example command to use the script:
For generating ESM-1b embeddings in bulk, use the provided script: `scripts/precompute_embeddings.py`. The script takes a directory of FASTA files (one sequence per file) and generates ESM-1b embeddings in the same format and directory structure as required by SoloSeq. Following is an example command to use the script:
...
@@ -335,7 +365,7 @@ SoloSeq allows you to use the same flags and optimizations as the MSA-based Open
...
@@ -335,7 +365,7 @@ SoloSeq allows you to use the same flags and optimizations as the MSA-based Open
**NOTE:** Due to the nature of the ESM-1b embeddings, the sequence length for inference using the SoloSeq model is limited to 1022 residues. Sequences longer than that will be truncated.
**NOTE:** Due to the nature of the ESM-1b embeddings, the sequence length for inference using the SoloSeq model is limited to 1022 residues. Sequences longer than that will be truncated.
### Training
## Training
To train the model, you will first need to precompute protein alignments.
To train the model, you will first need to precompute protein alignments.
...
@@ -473,9 +503,9 @@ environment. These run components of AlphaFold and OpenFold side by side and
...
@@ -473,9 +503,9 @@ environment. These run components of AlphaFold and OpenFold side by side and
ensure that output activations are adequately similar. For most modules, we
ensure that output activations are adequately similar. For most modules, we
target a maximum pointwise difference of `1e-4`.
target a maximum pointwise difference of `1e-4`.
## Building and using the docker container
## Building and Using the Docker Container
### Building the docker image
**Building the Docker Image**
Openfold can be built as a docker container using the included dockerfile. To build it, run the following command from the root of this repository:
Openfold can be built as a docker container using the included dockerfile. To build it, run the following command from the root of this repository:
...
@@ -483,7 +513,7 @@ Openfold can be built as a docker container using the included dockerfile. To bu
...
@@ -483,7 +513,7 @@ Openfold can be built as a docker container using the included dockerfile. To bu
docker build -t openfold .
docker build -t openfold .
```
```
### Running the docker container
**Running the Docker Container**
The built container contains both `run_pretrained_openfold.py` and `train_openfold.py` as well as all necessary software dependencies. It does not contain the model parameters, sequence, or structural databases. These should be downloaded to the host machine following the instructions in the Usage section above.
The built container contains both `run_pretrained_openfold.py` and `train_openfold.py` as well as all necessary software dependencies. It does not contain the model parameters, sequence, or structural databases. These should be downloaded to the host machine following the instructions in the Usage section above.