_Figure: Comparison of OpenFold and AlphaFold2 predictions to the experimental structure of PDB 7KDX, chain B._
## Overview: before running training script
NB: before running the test codes,please download the procrustes package first:
from https://github.com/theochem/procrustes
# OpenFold
Make sure that ```mmcif_cache.json```, which is the product of running ```scripts/generate_mmcif_cache.py``` is ready and available in ```tests/test_data```
I have uploaded the json file to owncloud [here](https://oc.embl.de/index.php/s/wVUwc1IHiJUt9sP)
A faithful but trainable PyTorch reproduction of DeepMind's
OpenFold carefully reproduces (almost) all of the features of the original open
source inference code (v2.0.1). The sole exception is model ensembling, which
fared poorly in DeepMind's own ablation testing and is being phased out in future
DeepMind experiments. It is omitted here for the sake of reducing clutter. In
cases where the *Nature* paper differs from the source, we always defer to the
latter.
OpenFold is trainable in full precision, half precision, or `bfloat16` with or without DeepSpeed,
and we've trained it from scratch, matching the performance of the original.
We've publicly released model weights and our training data — some 400,000
MSAs and PDB70 template hit files — under a permissive license. Model weights
are available via scripts in this repository while the MSAs are hosted by the
[Registry of Open Data on AWS (RODA)](https://registry.opendata.aws/openfold).
Try out running inference for yourself with our [Colab notebook](https://colab.research.google.com/github/aqlaboratory/openfold/blob/main/notebooks/OpenFold.ipynb).
OpenFold also supports inference using AlphaFold's official parameters, and
vice versa (see `scripts/convert_of_weights_to_jax.py`).
OpenFold has the following advantages over the reference implementation:
-**Faster inference** on GPU, sometimes by as much as 2x. The greatest speedups are achieved on (>= Ampere) GPUs.
-**Inference on extremely long chains**, made possible by our implementation of low-memory attention
([Rabe & Staats 2021](https://arxiv.org/pdf/2112.05682.pdf)). OpenFold can predict the structures of
sequences with more than 4000 residues on a single A100, and even longer ones with CPU offloading.
-**Custom CUDA attention kernels** modified from [FastFold](https://github.com/hpcaitech/FastFold)'s
kernels support in-place attention during inference and training. They use
4x and 5x less GPU memory than equivalent FastFold and stock PyTorch
implementations, respectively.
-**Efficient alignment scripts** using the original AlphaFold HHblits/JackHMMER pipeline or [ColabFold](https://github.com/sokrypton/ColabFold)'s, which uses the faster MMseqs2 instead. We've used them to generate millions of alignments.
-**FlashAttention** support greatly speeds up MSA attention.
## Installation (Linux)
All Python dependencies are specified in `environment.yml`. For producing sequence
alignments, you'll also need `kalign`, the [HH-suite](https://github.com/soedinglab/hh-suite),
and one of {`jackhmmer`, [MMseqs2](https://github.com/soedinglab/mmseqs2)(nightly build)}
installed on on your system. You'll need `git-lfs` to download OpenFold parameters.
Finally, some download scripts require `aria2c` and `aws`.
For convenience, we provide a script that installs Miniconda locally, creates a
`conda` virtual environment, installs all Python dependencies, and downloads
useful resources, including both sets of model parameters. Run:
```bash
scripts/install_third_party_dependencies.sh
```
To activate the environment, run:
```bash
source scripts/activate_conda_env.sh
```
To deactivate it, run:
```bash
source scripts/deactivate_conda_env.sh
```
With the environment active, compile OpenFold's CUDA kernels with
```bash
python3 setup.py install
```
To install the HH-suite to `/usr/bin`, run
```bash
# scripts/install_hh_suite.sh
```
## Usage
If you intend to generate your own alignments, e.g. for inference, you have two
choices for downloading protein databases, depending on whether you want to use
DeepMind's MSA generation pipeline (w/ HMMR & HHblits) or
[ColabFold](https://github.com/sokrypton/ColabFold)'s, which uses the faster
Unlike training monomer, chain_cache_data is not required but the train_mmcifs_cache is required. In this case, I selected these 9 mmcifs that are already in the previous test_data folder as a training set. ```./tests/test_data/train_mmcifs_cache.json``` in the command above record the information of these 9 structures and is needed to run the training code.
As noted before, you can skip the `binary_path` arguments if these binaries are
at `/usr/bin`. Expect this step to take a very long time, even for small
numbers of proteins.
## Issues
Testing the codes on cpu works fine but when running it on a gpu, it causes ```RuntimeError: CUDA error: device-side assert triggered``` at unexpected steps.
Alternatively, you can generate MSAs with the ColabFold pipeline (and templates
with HHsearch) with:
For example, this error was raised while calculating the best rotation matrix that aligns selected anchors during multi-chain permutation steps, I have to use
```torch.masked_select``` and ```torch.index_select``` in https://github.com/dingquanyu/openfold/blob/a1ef4c8fa99da5cff9501051de71be440ca3cedf/openfold/utils/loss.py#L2043 and https://github.com/dingquanyu/openfold/blob/a1ef4c8fa99da5cff9501051de71be440ca3cedf/openfold/utils/loss.py#L2060 instead of simply slicing the matrix like ```matrix[index]```.
where `--template_release_dates_cache_path` is a path to the mmCIF cache.
Note that `template_mmcif_dir` can be the same as `mmcif_dir` which contains
training targets. A suitable DeepSpeed configuration file can be generated with
`scripts/build_deepspeed_config.py`. The training script is
written with [PyTorch Lightning](https://github.com/PyTorchLightning/pytorch-lightning)
and supports the full range of training options that entails, including
multi-node distributed training, validation, and so on. For more information,
consult PyTorch Lightning documentation and the `--help` flag of the training
script.
Note that, despite its variable name, `mmcif_dir` can also contain PDB files
or even ProteinNet .core files.
To emulate the AlphaFold training procedure, which uses a self-distillation set
subject to special preprocessing steps, use the family of `--distillation` flags.
In cases where it may be burdensome to create separate files for each chain's
alignments, alignment directories can be consolidated using the scripts in
`scripts/alignment_db_scripts/`. First, run `create_alignment_db.py` to
consolidate an alignment directory into a pair of database and index files.
Once all alignment directories (or shards of a single alignment directory)
have been compiled, unify the indices with `unify_alignment_db_indices.py`. The
resulting index, `super.index`, can be passed to the training script flags
containing the phrase `alignment_index`. In this scenario, the `alignment_dir`
flags instead represent the directory containing the compiled alignment
databases. Both the training and distillation datasets can be compiled in this
way. Anecdotally, this can speed up training in I/O-bottlenecked environments.
## Testing
To run unit tests, use
```bash
scripts/run_unit_tests.sh
```
The script is a thin wrapper around Python's `unittest` suite, and recognizes
`unittest` arguments. E.g., to run a specific test verbosely:
```bash
scripts/run_unit_tests.sh -v tests.test_model
```
Certain tests require that AlphaFold (v2.0.1) be installed in the same Python
environment. These run components of AlphaFold and OpenFold side by side and
ensure that output activations are adequately similar. For most modules, we
target a maximum pointwise difference of `1e-4`.
## Building and using the docker container
Later on the same ```CUDA error: device-side assert triggered``` error was raised while adding dimensions to the ```atom_pred_positions``` in https://github.com/dingquanyu/openfold/blob/a1ef4c8fa99da5cff9501051de71be440ca3cedf/openfold/utils/loss.py#L989
### Building the docker image
Openfold can be built as a docker container using the included dockerfile. To build it, run the following command from the root of this repository:
```bash
docker build -t openfold .
```
### Running the docker container
The built container contains both `run_pretrained_openfold.py` and `train_openfold.py` as well as all necessary software dependencies. It does not contain the model parameters, sequence, or structural databases. These should be downloaded to the host machine following the instructions in the Usage section above.
The docker container installs all conda components to the base conda environment in `/opt/conda`, and installs openfold itself in `/opt/openfold`,
Before running the docker container, you can verify that your docker installation is able to properly communicate with your GPU by running the following command:
```bash
docker run --rm--gpus all nvidia/cuda:11.0-base nvidia-smi
```
Note the `--gpus all` option passed to `docker run`. This option is necessary in order for the container to use the GPUs on the host machine.
To run the inference code under docker, you can use a command like the one below. In this example, parameters and sequences from the alphafold dataset are being used and are located at `/mnt/alphafold_database` on the host machine, and the input files are located in the current working directory. You can adjust the volume mount locations as needed to reflect the locations of your data.
While AlphaFold's and, by extension, OpenFold's source code is licensed under
the permissive Apache Licence, Version 2.0, DeepMind's pretrained parameters
fall under the CC BY 4.0 license, a copy of which is downloaded to
`openfold/resources/params` by the installation script. Note that the latter
replaces the original, more restrictive CC BY-NC 4.0 license as of January 2022.
## Contributing
If you encounter problems using OpenFold, feel free to create an issue! We also
welcome pull requests from the community.
## Citing this work
Please cite our paper:
```bibtex
@article{Ahdritz2022.11.20.517210,
author={Ahdritz, Gustaf and Bouatta, Nazim and Kadyan, Sachin and Xia, Qinghui and Gerecke, William and O{\textquoteright}Donnell, Timothy J and Berenberg, Daniel and Fisk, Ian and Zanichelli, Niccolò and Zhang, Bo and Nowaczynski, Arkadiusz and Wang, Bei and Stepniewska-Dziubinska, Marta M and Zhang, Shang and Ojewole, Adegoke and Guney, Murat Efe and Biderman, Stella and Watkins, Andrew M and Ra, Stephen and Lorenzo, Pablo Ribalta and Nivon, Lucas and Weitzner, Brian and Ban, Yih-En Andrew and Sorger, Peter K and Mostaque, Emad and Zhang, Zhao and Bonneau, Richard and AlQuraishi, Mohammed},
title={OpenFold: Retraining AlphaFold2 yields new insights into its learning mechanisms and capacity for generalization},
elocation-id={2022.11.20.517210},
year={2022},
doi={10.1101/2022.11.20.517210},
publisher={Cold Spring Harbor Laboratory},
abstract={AlphaFold2 revolutionized structural biology with the ability to predict protein structures with exceptionally high accuracy. Its implementation, however, lacks the code and data required to train new models. These are necessary to (i) tackle new tasks, like protein-ligand complex structure prediction, (ii) investigate the process by which the model learns, which remains poorly understood, and (iii) assess the model{\textquoteright}s generalization capacity to unseen regions of fold space. Here we report OpenFold, a fast, memory-efficient, and trainable implementation of AlphaFold2, and OpenProteinSet, the largest public database of protein multiple sequence alignments. We use OpenProteinSet to train OpenFold from scratch, fully matching the accuracy of AlphaFold2. Having established parity, we assess OpenFold{\textquoteright}s capacity to generalize across fold space by retraining it using carefully designed datasets. We find that OpenFold is remarkably robust at generalizing despite extreme reductions in training set size and diversity, including near-complete elisions of classes of secondary structure elements. By analyzing intermediate structures produced by OpenFold during training, we also gain surprising insights into the manner in which the model learns to fold proteins, discovering that spatial dimensions are learned sequentially. Taken together, our studies demonstrate the power and utility of OpenFold, which we believe will prove to be a crucial new resource for the protein modeling community.},