Commit 4873c028 authored by jnwei's avatar jnwei Committed by Jennifer Wei
Browse files

Rough draft dump of docs and readthedocs build

parent 26f8761b
version: 2
# Set the OS, Python version and other tools you might need
build:
os: ubuntu-22.04
tools:
python: "3.9"
# Build documentation in the "docs/" directory with Sphinx
sphinx:
configuration: docs/conf.py
conda:
environment: docs/environment.yml
name: openfold-docs
channels:
- conda-forge
dependencies:
- sphinx=7
- pip:
- myst-parser=3
- furo
\ No newline at end of file
# Auxiliary Sequence Files for OpenFold Training
The training dataset of OpenFold is very large. The `pdb` directory alone contains 185,000 mmcifs; each chain for has multiple sequence alignment files and mmcif files.
OpenFold introduces a few new file structures for faster access to alignments and mmcif data.
This documentation will explain the benefits of having the condensed file structure, and explain the contents of each of the files.
## Default alignment file structure
One way to store mmcifs and alignments files would be to have a directory for each mmcif chain.
For example, consider two protein as a case study
```
- OpenProteinSet
└── mmcifs
└── 3lrm.cif
└── 6kwc.cif
...
```
In the `alignments` directory, [PDB:6KWC](https://www.rcsb.org/structure/6KWC) is a monomer with one chain, and thus would have one alignment direcotry. [PDB:3LRM](https://www.rcsb.org/structure/3lrm), a homotetramer, would have one alignment directory for each of its four chains.
```
- OpenProteinSet
└── alignments
└── 3lrm_A
├── bfd_uniclust_hits.a3m
├── mgnify_hits.a3m
├── pdb70_hits.hhr
└── uniref90_hits.a3m
└── 3lrm_B
├── bfd_uniclust_hits.a3m
├── mgnify_hits.a3m
├── pdb70_hits.hhr
└── uniref90_hits.a3m
└── 3lrm_C
├── bfd_uniclust_hits.a3m
├── mgnify_hits.a3m
├── pdb70_hits.hhr
└── uniref90_hits.a3m
└── 3lrm_D
├── bfd_uniclust_hits.a3m
├── mgnify_hits.a3m
├── pdb70_hits.hhr
└── uniref90_hits.a3m
└── 6kwc_A
├── bfd_uniclust_hits.a3m
├── mgnify_hits.a3m
├── pdb70_hits.hhr
└── uniref90_hits.a3m
...
```
In practice, the IO overhead of having one directory per protein chain makes accessing the alignments slow.
## OpenFold DB file structure
Here we describe a new filesystem that can be used by OpenFold for more efficient access of alignment file and index file contents
All together, the file directory would look like:
```
- OpenProteinSet
├── duplicate_pdb_chains.txt
└── pdb
├── mmcif_cache.json
└── mmcifs
└── 3lrm.cif
└── 6kwc.cif
└── alignment_db
└── alignment_db_0.db
└── alignment_db_1.db
...
└── alignment_db_9.db
└── alignment_db.index
```
We will describe each of the file types here.
### Alignments db files and index files
To speed up access of MSAs, OpenFold has an alternate alignments storage procedure. Instead of storing dedicated files for each single alignment, we consolidate large sets of alignments to single files referred to as _alignments_db's_. This can reduce I/O overhead and in practice we recommend using around 10 `alignments_db_x.db` files to store the total training set of OpenFold. During training, OpenFold can access each alignment using byte index pointers that are stored in a separate index file (`alignments_db.index`). The alignments for the `3LRM` and `6KWC` examples would be recorded in the index file as follows:
```alignments_db.index
{
...
"3lrm_A": {
"db": "alignment_db_0.db",
"files": [
["bfd_uniclust_hits.a3m", 212896478938, 1680200],
["mgnify_hits.a3m", 212893696883, 2782055],
["pdb70_hits.hhr", 212898159138, 614978],
["uniref90_hits.a3m", 212898774116, 6165789]
]
},
"6kwc_A": {
"db": "alignment_db_1.db",
"files": [
["bfd_uniclust_hits.a3m", 415618723280, 380289],
["mgnify_hits.a3m", 415618556077, 167203],
["pdb70_hits.hhr", 415619103569, 148672],
["uniref90_hits.a3m", 415617547852, 1008225]
]
}
...
}
```
For each entry, the corresponding `alignment_db` file and the byte start location and number of bytes to read the respective alignments are given. For example, the alignment information in `bfd_uniclust_hits.a3m` for chain `3lrm_A` can be found in the database file `alignment_db_0.db`, starting at byte location `212896478938` and reading in the next `1680200` bytes.
### Chain cache files and mmCIF cache files
Information from the mmcif files can be parsed in advance to create a `chain_cache.json` or a `mmcif_cache.json`. For OpenFold, the `chain_cache.json` is used to sample chains for training, and the `mmcif_cache.json` is used to prefilter templates.
Here's what the chain_cache.json entry looks like for our examples:
```chain_cache.json
{
...
"3lrm_A": {
"release_date": "2010-06-30",
"seq": "MFAFYFLTACISLKGVFGVSPSYNGLGLTPQMGWDNWNTFACDVSEQLLLDTADRISDLGLKDMGYKYIILDDCWSSGRDSDGFLVADEQKFPNGMGHVADHLHNNSFLFGMYSSAGEYTCAGYPGSLGREEEDAQFFANNRVDYLKYANCYNKGQFGTPEISYHRYKAMSDALNKTGRPVFYSLCNWGQDLTFYWGSGIANSWRMSGDVTAEFTRPDSRCPCDGDEYDCKYAGFHCSIMNILNKAAPMGQNAGVGGWNDLDNLEVGVGNLTDDEEKAHFSMWAMVKSPLIIGANVNNLKASSYSIYSQASVIAINQDSNGIPATRVWRYYVSDTDEYGQGEIQMWSGPLDNGDQVVALLNGGSVSRPMNTTLEEIFFDSNLGSKKLTSTWDIYDLWANRVDNSTASAILGRNKTATGILYNATEQSYKDGLSKNDTRLFGQKIGSLSPNAILNTTVPAHGIAFYRLRPSSDYKDDDDK",
"resolution": 2.7,
"cluster_size": 6
},
"3lrm_B": {
"release_date": "2010-06-30",
"seq": "MFAFYFLTACISLKGVFGVSPSYNGLGLTPQMGWDNWNTFACDVSEQLLLDTADRISDLGLKDMGYKYIILDDCWSSGRDSDGFLVADEQKFPNGMGHVADHLHNNSFLFGMYSSAGEYTCAGYPGSLGREEEDAQFFANNRVDYLKYANCYNKGQFGTPEISYHRYKAMSDALNKTGRPVFYSLCNWGQDLTFYWGSGIANSWRMSGDVTAEFTRPDSRCPCDGDEYDCKYAGFHCSIMNILNKAAPMGQNAGVGGWNDLDNLEVGVGNLTDDEEKAHFSMWAMVKSPLIIGANVNNLKASSYSIYSQASVIAINQDSNGIPATRVWRYYVSDTDEYGQGEIQMWSGPLDNGDQVVALLNGGSVSRPMNTTLEEIFFDSNLGSKKLTSTWDIYDLWANRVDNSTASAILGRNKTATGILYNATEQSYKDGLSKNDTRLFGQKIGSLSPNAILNTTVPAHGIAFYRLRPSSDYKDDDDK",
"resolution": 2.7,
"cluster_size": 6
},
"3lrm_C": {
"release_date": "2010-06-30",
"seq": "MFAFYFLTACISLKGVFGVSPSYNGLGLTPQMGWDNWNTFACDVSEQLLLDTADRISDLGLKDMGYKYIILDDCWSSGRDSDGFLVADEQKFPNGMGHVADHLHNNSFLFGMYSSAGEYTCAGYPGSLGREEEDAQFFANNRVDYLKYANCYNKGQFGTPEISYHRYKAMSDALNKTGRPVFYSLCNWGQDLTFYWGSGIANSWRMSGDVTAEFTRPDSRCPCDGDEYDCKYAGFHCSIMNILNKAAPMGQNAGVGGWNDLDNLEVGVGNLTDDEEKAHFSMWAMVKSPLIIGANVNNLKASSYSIYSQASVIAINQDSNGIPATRVWRYYVSDTDEYGQGEIQMWSGPLDNGDQVVALLNGGSVSRPMNTTLEEIFFDSNLGSKKLTSTWDIYDLWANRVDNSTASAILGRNKTATGILYNATEQSYKDGLSKNDTRLFGQKIGSLSPNAILNTTVPAHGIAFYRLRPSSDYKDDDDK",
"resolution": 2.7,
"cluster_size": 6
},
"3lrm_D": {
"release_date": "2010-06-30",
"seq": "MFAFYFLTACISLKGVFGVSPSYNGLGLTPQMGWDNWNTFACDVSEQLLLDTADRISDLGLKDMGYKYIILDDCWSSGRDSDGFLVADEQKFPNGMGHVADHLHNNSFLFGMYSSAGEYTCAGYPGSLGREEEDAQFFANNRVDYLKYANCYNKGQFGTPEISYHRYKAMSDALNKTGRPVFYSLCNWGQDLTFYWGSGIANSWRMSGDVTAEFTRPDSRCPCDGDEYDCKYAGFHCSIMNILNKAAPMGQNAGVGGWNDLDNLEVGVGNLTDDEEKAHFSMWAMVKSPLIIGANVNNLKASSYSIYSQASVIAINQDSNGIPATRVWRYYVSDTDEYGQGEIQMWSGPLDNGDQVVALLNGGSVSRPMNTTLEEIFFDSNLGSKKLTSTWDIYDLWANRVDNSTASAILGRNKTATGILYNATEQSYKDGLSKNDTRLFGQKIGSLSPNAILNTTVPAHGIAFYRLRPSSDYKDDDDK",
"resolution": 2.7,
"cluster_size": 6
},
"6kwc_A": {
"release_date": "2021-01-27",
"seq": "GSTIQPGTGYNNGYFYSYWNDGHGGVTYTNGPGGQFSVNWSNSGEFVGGKGWQPGTKNKVINFSGSYNPNGNSYLSVYGWSRNPLIEYYIVENFGTYNPSTGATKLGEVTSDGSVYDIYRTQRVNQPSIIGTATFYQYWSVRRNHRSSGSVNTANHFNAWAQQGLTLGTMDYQIVAVQGYFSSGSASITVS",
"resolution": 1.297,
"cluster_size": 195
},
...
}
```
The mmcif_cache.json file would contain similar information, but condensed by mmcif id, e.g.
```mmcif_cache.json
{
"3lrm": {
"release_date": "2010-06-30",
"chain_ids": [
"A",
"B",
"C",
"D"
],
"seqs": [
"MFAFYFLTACISLKGVFGVSPSYNGLGLTPQMGWDNWNTFACDVSEQLLLDTADRISDLGLKDMGYKYIILDDCWSSGRDSDGFLVADEQKFPNGMGHVADHLHNNSFLFGMYSSAGEYTCAGYPGSLGREEEDAQFFANNRVDYLKYANCYNKGQFGTPEISYHRYKAMSDALNKTGRPVFYSLCNWGQDLTFYWGSGIANSWRMSGDVTAEFTRPDSRCPCDGDEYDCKYAGFHCSIMNILNKAAPMGQNAGVGGWNDLDNLEVGVGNLTDDEEKAHFSMWAMVKSPLIIGANVNNLKASSYSIYSQASVIAINQDSNGIPATRVWRYYVSDTDEYGQGEIQMWSGPLDNGDQVVALLNGGSVSRPMNTTLEEIFFDSNLGSKKLTSTWDIYDLWANRVDNSTASAILGRNKTATGILYNATEQSYKDGLSKNDTRLFGQKIGSLSPNAILNTTVPAHGIAFYRLRPSSDYKDDDDK",
"MFAFYFLTACISLKGVFGVSPSYNGLGLTPQMGWDNWNTFACDVSEQLLLDTADRISDLGLKDMGYKYIILDDCWSSGRDSDGFLVADEQKFPNGMGHVADHLHNNSFLFGMYSSAGEYTCAGYPGSLGREEEDAQFFANNRVDYLKYANCYNKGQFGTPEISYHRYKAMSDALNKTGRPVFYSLCNWGQDLTFYWGSGIANSWRMSGDVTAEFTRPDSRCPCDGDEYDCKYAGFHCSIMNILNKAAPMGQNAGVGGWNDLDNLEVGVGNLTDDEEKAHFSMWAMVKSPLIIGANVNNLKASSYSIYSQASVIAINQDSNGIPATRVWRYYVSDTDEYGQGEIQMWSGPLDNGDQVVALLNGGSVSRPMNTTLEEIFFDSNLGSKKLTSTWDIYDLWANRVDNSTASAILGRNKTATGILYNATEQSYKDGLSKNDTRLFGQKIGSLSPNAILNTTVPAHGIAFYRLRPSSDYKDDDDK",
"MFAFYFLTACISLKGVFGVSPSYNGLGLTPQMGWDNWNTFACDVSEQLLLDTADRISDLGLKDMGYKYIILDDCWSSGRDSDGFLVADEQKFPNGMGHVADHLHNNSFLFGMYSSAGEYTCAGYPGSLGREEEDAQFFANNRVDYLKYANCYNKGQFGTPEISYHRYKAMSDALNKTGRPVFYSLCNWGQDLTFYWGSGIANSWRMSGDVTAEFTRPDSRCPCDGDEYDCKYAGFHCSIMNILNKAAPMGQNAGVGGWNDLDNLEVGVGNLTDDEEKAHFSMWAMVKSPLIIGANVNNLKASSYSIYSQASVIAINQDSNGIPATRVWRYYVSDTDEYGQGEIQMWSGPLDNGDQVVALLNGGSVSRPMNTTLEEIFFDSNLGSKKLTSTWDIYDLWANRVDNSTASAILGRNKTATGILYNATEQSYKDGLSKNDTRLFGQKIGSLSPNAILNTTVPAHGIAFYRLRPSSDYKDDDDK",
"MFAFYFLTACISLKGVFGVSPSYNGLGLTPQMGWDNWNTFACDVSEQLLLDTADRISDLGLKDMGYKYIILDDCWSSGRDSDGFLVADEQKFPNGMGHVADHLHNNSFLFGMYSSAGEYTCAGYPGSLGREEEDAQFFANNRVDYLKYANCYNKGQFGTPEISYHRYKAMSDALNKTGRPVFYSLCNWGQDLTFYWGSGIANSWRMSGDVTAEFTRPDSRCPCDGDEYDCKYAGFHCSIMNILNKAAPMGQNAGVGGWNDLDNLEVGVGNLTDDEEKAHFSMWAMVKSPLIIGANVNNLKASSYSIYSQASVIAINQDSNGIPATRVWRYYVSDTDEYGQGEIQMWSGPLDNGDQVVALLNGGSVSRPMNTTLEEIFFDSNLGSKKLTSTWDIYDLWANRVDNSTASAILGRNKTATGILYNATEQSYKDGLSKNDTRLFGQKIGSLSPNAILNTTVPAHGIAFYRLRPSSDYKDDDDK"
],
"no_chains": 4,
"resolution": 2.7
},
"6kwc": {
"release_date": "2021-01-27",
"chain_ids": [
"A"
],
"seqs": [
"GSTIQPGTGYNNGYFYSYWNDGHGGVTYTNGPGGQFSVNWSNSGEFVGGKGWQPGTKNKVINFSGSYNPNGNSYLSVYGWSRNPLIEYYIVENFGTYNPSTGATKLGEVTSDGSVYDIYRTQRVNQPSIIGTATFYQYWSVRRNHRSSGSVNTANHFNAWAQQGLTLGTMDYQIVAVQGYFSSGSASITVS"
],
"no_chains": 1,
"resolution": 1.297
},
...
}
```
### Duplicate pdb chain files
Duplicate chains occur across pdb entries. Some of these chains are the homomeric units of a multimer, others are subunits that are shared across different protein.
To reduce storage overhead of creating / storing identical data for duplicate entries, we have a duplicate chain file. Each line stores the all chains that are identical. Our `6kwc` and `3lrm` examples would be stored as follows.
```duplicate_pdb_chains.txt
...
6kwc_A
3lrm_A 3lrm_B 3lrm_C 3lrm_D
...
```
# Inference OpenFold
In this guide, we will cover how to use OpenFold to make structure predictions.
## Background
We currently offer three modes of inference prediction:
- Monomer
- Multimer
- Single Sequence (Soloseq)
This guide will focus on monomer mode prediction, the next sections will describe Multimer and Single Sequence prediction.
### Pre-requisites:
- OpenFold Conda Environment. Instructions to create this environment are here [[OpenFold installation]]
- Sequence databases for performing multiple sequence alignments. Instructions here [ TODO add link]
## Running AlphaFold Model Inference
The script `run_pretrained_openfold.py` performs model inference. We will go through the steps of how to use this script.
### Download Model Parameters
For monomer inference, you may either use the model parameters provided by Deepmind, or you may use the OpenFold trained parameters. Both models should give similar performance, please see [TODO: link to nature paper] for further reference.
The model parameters provided by Deepmind can be downloaded with the following script located in this repository's `scripts/` directory:
```
$ bash scripts/download_alphafold_params.sh ${PARAMS_DIR}
```
To use the OpenFold trained parameters, you can use the following script
```
$ bash scripts/download_openfold_params.sh ${PARAMS_DIR}
```
We recommend selecting `openfold/resources` as the params directory as this is the default directory used by the `run_pretrained_openfold.py` to locate parameters.
If you choose to use a different directory, you may make a symlink to the `openfold/resources` directory, or specify an alternate parameter path with the command line argument `--jax_path` for AlphaFold parameters or `--openfold_checkpoint_path` for OpenFold parameters.
### Model Inference
The input to `run_pretrained_openfold.py` is a directory of FASTA files. AlphaFold-style models also require a sequence alignment to perform inference.
If you do not have sequence alignments for your input sequences, you can compute them using the inference script directly by following the instructions in [[Inference#Model inference without pre-computed alignments|Model inference without pre-computed alignments]]
Otherwise, if you already have alignments for your input FASTA sequences, skip ahead to [[Inference#Model inference with pre-compute alignments|Model inference with pre-computed alignments]].
#### Model inference without pre-computed alignments
The following command performs a sequence alignment against the OpenProteinSet databases and performs model inference.
```
python3 run_pretrained_openfold.py \
${INPUT_FASTA_DIR} \
${TEMPLATE_MMCIF_DIR}
--output_dir ${OUTPUT_DIR} \
--config_preset model_1_ptm \
--uniref90_database_path ${BASE_DATA_DIR}/uniref90 \
--mgnify_database_path ${BASE_DATA_DIR}/mgnify/mgy_clusters_2018_12.fa \
--pdb70_database_path ${BASE_DATA_DIR}/pdb70 \
--uniclust30_database_path ${BASE_DATA_DIR}/uniclust30/uniclust30_2018_08/uniclust30_2018_08 \
--output_dir ${OUTPUT_DIR}/output2 \
--bfd_database_path ${BASE_DATA_DIR}/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt \
--model_device "cuda:0" \
```
**Required arguments:**
- `--output_dir`: specify the output directory
- `${INPUT_FASTA_DIR}`: Directory of query fasta files, one sequence per file. An example input file is provided under `examples/monomer_inference`
- `${TEMPLATE_MMCIF_DIR}`: MMCIF files to use for template matching. This directory is required even if using template free inference.
- `*_database_path`: Paths to sequence databases for sequence alignments. Instructions on how to download the sequence databases (Uniref90, Mgnify, PDB70, Uniclust, BFD) are provided in [[OpenFold Dataset Download Instructions]].
- `--model_device`: Specify to use a GPU is one is available.
#### Model inference with pre-computed alignments
To perform model inference with pre-computed alignments, use the following command
```
python3 run_pretrained_openfold.py ${INPUT_FASTA_DIR} \
${TEMPLATE_MMCIF_DIR} \
--output_dir ${OUTPUT_DIR} \
--use_precomputed_alignments ${PRECOMPUTED_ALIGNMENTS} \
--config_preset model_1_ptm \
--model_device "cuda:0" \
```
where `${PRECOMPUTED_ALIGNMENTS}` is a directory that contains alignments. A sample alignments directory structure for a single query is.
```
alignments
└── gfp
   ├── bfd_uniclust_hits.a3m
   ├── hhsearch_output.hhr
   ├── mgnify_hits.sto
   └── uniref90_hits.sto
```
!!JB NOTE!! Can you add some information about what each type of file contains or what would happen if only 3/4 were present? I'm not knowledgeable on alignments, but when I was first running things w OF it felt like there may be different ways that some people generate alignments. And, having 4 sets seemed a bit confusing to me. Adding some optional details here might make this a bit more user friendly in case they're not sure if they're alignments will work.
#### Configuration settings for template modeling / pTM scoring
There are a few configuration settings available for template based and template-free modeling, and for the option to estimate a predicted template modeling score (pTM).
This table provides guidance on which setting to use for each set of predictions, as well as the parameters to select for each preset.
| Setting | `config_preset` | AlphaFold params (match config name) | OpenFold params (any are allowed) |
| -------------------------: | ----------------------------------------: | :-------------------------------------------------------------------------------- | :--------------------------------- |
| With template, no ptm | model_1<br>model_2 | `parms_model_1.npz`<br>`parms_model_2.npz` | `finetuning_[2-5].pt` |
| With template, with ptm | model_1_ptm<br>model_2_ptm | `params_model_1_ptm.npz`<br>`params_model_2_ptm.npz` | `finetuning_ptm_[1-2].pt` |
| Without template, no ptm | model_3<br>model_4<br>model_5 | `parms_model_3.npz`<br>`parms_model_4.npz`<br>`parms_model_5.npz` | `finetuning_no_templ_[1-2].pt` |
| Without template, with ptm | model_3_ptm<br>model_4_ptm<br>model_5_ptm | `parms_model_3_ptm.npz`<br>`parms_model_4_ptm.npz`<br>`parms_model_5_ptm.npz`<br> | `finetuning_no_templ_ptm_[1-2].pt` |
If you use AlphaFold parameters, and the AlphaFold parameters are located in the default parameter directory (e.g. `openfold/resources`) the parameters that match the `--config_preset` will be selected.
The full set of configurations available for all 5 AlphaFold model presets can be viewed in [`config.py`](https://github.com/aqlaboratory/openfold/blob/main/openfold/config.py#L105). The [[OpenFold Parameters]] page contains more information about the individual OpenFold parameter files.
#### Model outputs
The expected output contents are as follows:
- `alignments`: Directory of alignments. One directory is made per query sequence, and each directory contains alignments against each of the databases used.
- `predictions`: PDB files for predicted structures
- `timings.json`: Json with timings for inference and relaxation, if specified
### Optional Flags
Some commonly used command line flags are here. A full list of flags can be viewed from the `--help` menu
- `--config_preset`: Specify a different model configuration. There are 5 available model preset settings, some of which support template modeling, others support template-free modeling. The default is `model_1`. More details can be below in the [[Inference#Template-free modeling]] section
- `--hmmsearch_binary_path`, `--hmmbuild_binary_path`, etc. Hmmer, HHsuite, kalign are required to run alignments. `run_pretrained_openfold.py` will search for these packages in the `bin/` directory of your conda environment. If needed, you can specify a different binary directory with these arguments.
- `--openfold_checkpoint_path` : Uses an checkpoint or parameter file. Expected types are Deepspeed checkpoint files or `.pt` files. Make sure your selected checkpoint file matches the configuration setting chosen in `--config_preset`.
- `--data_random_seed`: Specifies a random seed to use.
- `--save_outputs`: Saves a copy of all outputs from the model, e.g. the output of the msa track, ptm heads.
- `--experiment_config_json`: Specify configuration settings using a json file. For example, passing a json with `{globals.relax.max_iterations = 10}` specifies 10 as the maximum number of relaxation iterations. See for [`openfold/config.py`](https://github.com/aqlaboratory/openfold/blob/main/openfold/config.py#L283) the full dictionary of configuration settings. Any parameters that are not manually set in these configuration settings will refer to the defaults specified by your `config_preset`.
### Advanced Options for Increasing Efficiency
#### Speeding up inference
The **DeepSpeed DS4Sci_EvoformerAttention kernel** is a memory-efficient attention kernel developed as part of a collaboration between OpenFold and the DeepSpeed4Science initiative.
If your system supports deepseed, using deepspeed generally leads an inference speedup of 2 - 3x without significant additional memory use. You may specify this option by selecting the `--use_deepspeed_inference` argument.
If DeepSpeed is unavailable for your system, you may also try using [FlashAttention](https://github.com/HazyResearch/flash-attention) by adding `globals.use_flash = True` to the `--experiment_config_json`. Note that FlashAttention appears to work best for sequences with < 1000 residues.
#### Large-scale batch inference
For large-scale batch inference, we offer an optional tracing mode, which massively improves runtimes at the cost of a lengthy model compilation process. To enable it, add `--trace_model` to the inference command.
#### Configuring the chunk size for sequence alignments
Note that chunking (as defined in section 1.11.8 of the AlphaFold 2 supplement) is enabled by default in inference mode. To disable it, set `globals.chunk_size` to `None` in the config. If a value is specified, OpenFold will attempt to dynamically tune it, considering the chunk size specified in the config as a minimum. This tuning process automatically ensures consistently fast runtimes regardless of input sequence length, but it also introduces some runtime variability, which may be undesirable for certain users. It is also recommended to disable this feature for very long chains (see below). To do so, set the `tune_chunk_size` option in the config to `False`.
#### Long sequence inference
To minimize memory usage during inference on long sequences, consider the following changes:
- As noted in the AlphaFold-Multimer paper, the AlphaFold/OpenFold template stack is a major memory bottleneck for inference on long sequences. OpenFold supports two mutually exclusive inference modes to address this issue. One, `average_templates` in the `template` section of the config, is similar to the solution offered by AlphaFold-Multimer, which is simply to average individual template representations. Our version is modified slightly to accommodate weights trained using the standard template algorithm. Using said weights, we notice no significant difference in performance between our averaged template embeddings and the standard ones. The second, `offload_templates`, temporarily offloads individual template embeddings into CPU memory. The former is an approximation while the latter is slightly slower; both are memory-efficient and allow the model to utilize arbitrarily many templates across sequence lengths. Both are disabled by default, and it is up to the user to determine which best suits their needs, if either.
- Inference-time low-memory attention (LMA) can be enabled in the model config. This setting trades off speed for vastly improved memory usage. By default, LMA is run with query and key chunk sizes of 1024 and 4096, respectively. These represent a favorable tradeoff in most memory-constrained cases. Powerusers can choose to tweak these settings in `openfold/model/primitives.py`. For more information on the LMA algorithm, see the aforementioned Staats & Rabe preprint.
- Disable `tune_chunk_size` for long sequences. Past a certain point, it only wastes time.
- As a last resort, consider enabling `offload_inference`. This enables more extensive CPU offloading at various bottlenecks throughout the model.
- Disable FlashAttention, which seems unstable on long sequences.
Using the most conservative settings, we were able to run inference on a 4600-residue complex with a single A100. Compared to AlphaFold's own memory offloading mode, ours is considerably faster; the same complex takes the more efficent AlphaFold-Multimer more than double the time. Use the `long_sequence_inference` config option to enable all of these interventions at once. The `run_pretrained_openfold.py` script can enable this config option with the `--long_sequence_inference` command line option
Input FASTA files containing multiple sequences are treated as complexes. In this case, the inference script runs AlphaFold-Gap, a hack proposed [here](https://twitter.com/minkbaek/status/1417538291709071362?lang=en), using the specified stock AlphaFold/OpenFold parameters (NOT AlphaFold-Multimer).
\ No newline at end of file
For OpenFold model parameters, v. 06_22.
# Training details:
Trained using OpenFold on 44 A100s using the training schedule from Table 4 in
the AlphaFold supplement. AlphaFold was used as the pre-distillation model.
Training data is hosted publicly in the "OpenFold Training Data" RODA repository.
To improve model diversity, we forked training after the initial training phase
and finetuned an additonal branch without templates.
# Parameter files:
Parameter files fall into the following categories:
initial_training.pt:
OpenFold at the end of the initial training phase.
finetuning_x.pt:
Checkpoints in chronological order corresponding to peaks in the
validation LDDT-Ca during the finetuning phase. Roughly evenly spaced
across the 45 finetuning epochs.
NOTE: finetuning_1.pt, which was included in a previous release, has
been deprecated.
finetuning_no_templ_x.pt
Checkpoints in chronological order corresponding to peaks during an
additional finetuning phase also starting from the 'initial_training.pt'
checkpoint but with templates disabled.
finetuning_no_templ_ptm_x.pt
Checkpoints in chronological order corresponding to peaks during the
pTM training phase of the `no_templ` branch. Models in this category
include the pTM module and comprise the most recent of the checkpoints
in said branch.
finetuning_ptm_x.pt:
Checkpoints in chronological order corresponding to peaks in the pTM
training phase of the mainline branch. Models in this category include
the pTM module and comprise the most recent of the checkpoints in said
branch.
Average validation LDDT-Ca scores for each of the checkpoints are listed below.
The validation set contains approximately 180 chains drawn from CAMEO over a
three-month period at the end of 2021.
initial_training: 0.9088
finetuning_2: 0.9061
finetuning_3: 0.9075
finetuning_4: 0.9059
finetuning_5: 0.9054
finetuning_no_templ_1: 0.9014
finetuning_no_templ_2: 0.9032
finetuning_no_templ_ptm_1: 0.9025
finetuning_ptm_1: 0.9075
finetuning_ptm_2: 0.9097
\ No newline at end of file
# Setting up the OpenFold PDB training set from RODA
The multiple sequence alignments of OpenProteinSet and mmCIF structure files required to train OpenFold are freely available at the [Registry of Open Data on AWS (RODA)](https://registry.opendata.aws/openfold/). Additionally, OpenFold requires some postprocessing and [auxiliary files](Aux_seq_files.md) for training that need to be generated from the AWS data manually. This documentation is intended to give a full overview of those steps starting from the data download, assuming that the OpenFold codebase has already been set up on your system previously at the path `$OF_DIR` and the `openfold` environment is activated.
## 1. Downloading alignments and structure files
To fetch all the alignments corresponding to the original PDB training set of OpenFold alongside their mmCIF 3D structures, you can run the following commands:
```bash
mkdir -p alignment_data/alignment_dir_roda --recursive --no-sign-request
aws s3 cp s3://openfold/pdb/ alignment_data/alignment_dir_roda/ --recursive --no-sign-request
mkdir pdb_data
aws s3 cp s3://openfold/pdb_mmcif.zip pdb_data/ --no-sign-request
aws s3 cp s3://openfold/duplicate_pdb_chains.txt pdb_data/ --no-sign-request
```
The nested alignment directory structure is not yet exactly what OpenFold expects, so you can run the `flatten_roda.sh` script to convert them to the correct format:
```bash
bash $OF_DIR/scripts/flatten_roda.sh alignment_data/alignment_dir_roda alignment_data/
```
Afterwards, the old directory can be safely removed:
```bash
rm -r alignment_data/alignment_dir_roda
```
## 2. Creating alignment DBs (optional)
As further explained in [Auxiliary Sequence Files in OpenFold](Aux_seq_files.md), OpenFold supports an alternate format for storing alignments that can increase training performance in I/O bottlenecked systems. These so-called `alignment_db` files can be generated with the following script:
```bash
python $OF_DIR/scripts/alignment_db_scripts/create_alignment_db_sharded.py \
alignment_data/alignments \
alignment_data/alignment_dbs \
alignment_db \
--n_shards 10 \
--duplicate_chains_file pdb_data/duplicate_pdb_chains.txt
```
We recommend creating 10 total `alignment_db` files (= "shards") for better
filesystem health and fast preprocessing, but note that this script will only run
optimally if the number of CPUs on your machine is at least as big as the number
of shards you are creating.
As an optional check, you can run the following command which should return 634,434:
```bash
grep "files" alignment_data/alignment_dbs/alignment_db.index | wc -l
```
## 3. Adding duplicate chains to alignments
To save space, the OpenProteinSet alignment database is stored without duplicates, meaning that only one representative alignment is stored for all chains with identical sequences in the PDB and duplicate instances are tracked with a [`duplicate_chains.txt`](Aux_seq_files.md#duplicate-pdb-chain-files) file. As OpenFold will select chains during training based on the chains in the alignment directory (or `alignment_db`), we therefore need to add those duplicate chains back in in order to train on the full conformational diversity of chains in the PDB.
If you've followed the optional Step 2, the `.index` file of your `alignment_db` files will have already been adjusted for duplicates and you can proceed to the next step. Otherwise, the standard alignment directory can be expanded to accommodate duplicates by inserting symlinked directories for the duplicate chains that point to their representative alignments:
```bash
python $OF_DIR/scripts/expand_alignment_duplicates.py \
alignment_data/alignments \
pdb_data/duplicate_pdb_chains.txt
```
As an optional check, the following command should return 634,434:
```bash
ls alignment_data/alignments/ | wc -l
```
## 4. Generating cluster-files
[JENNIFER: We could simply upload this cluster file as well and having it set up by the user is pretty unnecessary, but the scripts would still be useful to have for setup of new datasets]\
The AlphaFold dataloader adjusts the sampling probability of chains by their inverse cluster size, so we need to generate these sequence clusters for our training set.
As a first step, we'll need a `.fasta` file of all sequences in the training set. This can be generated with the following scripts, depending on how you set up your alignment data in the previous steps:
**Use this if you set up the duplicate-expanded alignment directory (faster):**
```bash
python $OF_DIR/scripts/alignment_data_to_fasta.py \
alignment_data/all-seqs.fasta \
--alignment_dir alignment_data/alignments
```
[JENNIFER: These scripts replace `data_dir_to_fasta.py` which is horribly slow as it reparses all mmCIF structure files. What those scripts do instead is fetch the sequence information from the >query line in the MSA files which is a lot faster. Technically this means that the chains in the generated .fasta may not 100% mirror the available mmCIF files in the PDB directory if there were some MSA generation errors for some of those structures. However as OF only trains on the structures with alignments available this is fine in practice, but I just wanted to note that the output of `data_dir_to_fasta.py` may technically not be 100% identical.]
**Use this if you set up the `alignment_db` files:**
```bash
python $OF_DIR/scripts/alignment_data_to_fasta.py \
alignment_data/all-seqs.fasta \
--alignment_db_index alignment_data/alignment_dbs/alignment_db.index
```
Next, we need to generate a cluster file at 40% sequence identity, which will contain all chains in a particular cluster on the same line. You'll need [MMSeqs2](https://github.com/soedinglab/MMseqs2?tab=readme-ov-file#installation) for this as well, which can be set up either in a conda environment or as a binary. (JENNIFER: do we want to add that as a dependency?)
```bash
python $OF_DIR/scripts/fasta_to_clusterfile.py \
alignment_data/all-seqs.fasta \
alignment_data/all-seqs_clusters-40.txt \
/path/to/mmseqs \
--seq-id 0.4
```
## 5. Generating cluster-files
As a last step, OpenFold requires "cache" files (JENNIFER: insert cross-link here) with metadata information for each chain that are used for choosing templates and samples during training.
The mmCIF-cache is used for filtering templates and can be generated with the following script:
```bash
mkdir pdb_data/data_caches
python $OF_DIR/scripts/generate_mmcif_cache.py \
pdb_data/mmcif_files \
pdb_data/data_caches/mmcif_cache.json \
--no_workers 16
```
The chain-data-cache is used for filtering training samples and adjusting per-chain sampling probabilities and can be generated with the following script:
```bash
python $OF_DIR/scripts/generate_chain_data_cache.py \
pdb_data/mmcif_files \
pdb_data/data_caches/chain_data_cache.json \
--cluster_file alignment_data/all-seqs_clusters-40.txt \
--no_workers 16
```
# Training OpenFold
## Background
This guide covers how to train an OpenFold model. These instructions focus on training a model for predicting monomers, but additional instructions are provided for training a monomer / multimer model.
### Pre-requisites:
This guide requires the following:
- [Installation of OpenFold and dependencies](installation.md) (Including jackhmmer and hhblits depedencies)
- A preprocessed dataset:
- For this guide, we will use the original OpenFold dataset which is available on RODA. This dataset can be downloaded with the following command:
`./scripts/download_roda_dbs.sh <dst_path>`[Download the dataset used to train the OpenFold model]
- If you wish to construct your own dataset, [these instructions](OpenFold_Training_Setup.md) provide guidance for preprocessing alignments into an OpenFold format.
- GPUs configured with CUDA. Training OpenFold with CPUs only is not supported.
Expected directory structure:
```
- OpenProteinSet
└── alignments
└── 2x7l_M
└── mgnify_hits.a3m
└── bfd_uniclust_hits.a3m
└── uniref90_hits.a3m
└── pdb70_hits.hhr
...
└── mmcifs
└── 3u8d.cif
└── 3lrm.cif
...
└── mmcif_cache.json
└── chain_data_cache.json
```
The `mmcif_cache.json` and the `chain_data_cache.json` provide metadata for the mmcif and the protein chains in the dataset.
## Training a new OpenFold model
#### Basic command
The basic command to train a new OpenFold model is
```
python3 train_openfold.py $DATA_DIR/mmcifs/ $DATA_DIR/alignments/ template_mmcif_dir/ $OUTPUT_DIR \
--max_template_date 2021-10-10 \
--train_chain_data_cache_path chain_data_cache.json \
--template_release_dates_cache_path mmcif_cache.json \
--config_preset initial_training \
--seed 42 \
--obsolete_pdbs_file_path obsolete.dat \
--num_nodes 1 \
--gpus 4 \
--num_workers 4 \
```
The required arguments are:
- `mmcif_dir` : Mmcif files for the training set.
- `alignment_dir`: Alignments for the sequences in `mmcif_dir`, see expected directory structure
- `template_mmcif_dir`: Template mmcif files with structures, which can be the same directory as mmcif_dir. The `max_template_date` and `template_release_dates_cache_path` will specify which templates will be allowed based on a date cutoff
- `$OUTPUT_DIR` : Where model checkpoint files and other outputs will be saved.
Commonly used flags include:
- `config_preset`: Specifies which selection of hyperparameters should be used for initial model training. Commonly used configs are defined in `openfold/config.py`
- `num_nodes` and `gpus`: Specifies number of nodes and GPUs available to train OpenFold.
- `seed` - Specifies random seed
- `num_workers`: Number of CPU workers to assign for creating dataset examples
- `obsolete_pdbs_file_path`: Specifies obsolete pdb IDs that should be excluded from training.
- `val_data_dir` and `val_alignment_dir`: Specifies data directory and alignments for validation dataset.
```{note}
Note that `--seed` must be specified to correctly configure training examples on multi-GPU training runs
```
#### Train OpenFold with Different Dataset Configurations
If the [OpenFold alignment database](OpenFold_Training_Setup.md#2-creating-alignment-dbs-optional) setup is used, the training command will instead look like this:
#### Additional command line flag options:
Here we provide brief descriptions for customizing your training run of OpenFold. A full description of all flags can be accessed by using the `--help` option in the script
- **Use Deepspeed acceleration strategy:** `--deepspeed_config` This option configures OpenFold to use custom Deepspeed kernels. This option requires a deepspeed_config.json, you can create your own, or use the one in the OpenFold directory
- **Use a validation dataset:** Specify validation database paths with `--val_data_dir` + `--val_alignment_dir`. Validation metrics will be evaluated on these datasets.
- **Use a self-distillation dataset:** Specify paths with `--distillation_data_dir` and `--distillation_alignment_dir` flags
- **Change specific parameters in the model or data setup:** `--experiment_config_json`. These parameters must be defined in the [`openfold/config.py`](https://github.com/aqlaboratory/openfold/blob/main/openfold/config.py). For example to change the crop size for training a model, you can write the following json:
```cropsize.json
{
"data.train.crop_size": 128
}
```
- **Configure training settings with PyTorch Lightning**
Some flags e.g. `--precision`, `--max_epochs` configure training behavior. See the Pytorch Lightning Trainer args section in the `--help` menu for more information and consult [Pytorch lightning documentation](https://lightning.ai/docs/pytorch/stable/)
- Precision: On A100s, OpenFold training works best with bfloat 16 precision (e.g. `--precision bf16-mixed`)
- **Restart training from an existing checkpoint:** Use the `--resume_from_ckpt` to restart training from an existing checkpoint.
## Advanced Training Configurations
### Training OpenFold Multimer
At this time, we do not have a multimer training set available. To prepare your own multimer training set, please see the instructions at [Data Processing - multimer]
The basic command for training a multimer model is then:
```
multimer training command here
```
The key differences are:
- Dataset configuration / preparation
### Fine tuning from existing model weights
If you have existing model weights, you can fine tune the model using the following command:
```
python3 train_openfold.py mmcif_dir/ alignment_dir/ template_mmcif_dir/ $OUTPUT_DIR \
--max_template_date 2021-10-10 \
--train_chain_data_cache_path chain_data_cache.json \
--template_release_dates_cache_path mmcif_cache.json \
--config_preset finetuning \
--seed 4242022 \
--obsolete_pdbs_file_path obsolete.dat \
--num_nodes 1 \
--gpus 4 \
--num_workers 4 \
--resume_from_ckpt $CHECKPOINT_PATH
--resume_model_weights_only
```
If you have model parameters from OpenFold v1.x, you may need to convert your checkpoint file or parameter. See [[Converting OpenFold v1 Weights]]
### Using MPI
If MPI is configured on your system, and you would like to use MPI to train OpenFold models, you may do so with the following step:
1. Add the `mpi4py` package, which are available through pip and conda. Please see [mpi4py documentation](https://pypi.org/project/mpi4py/) for more instructions on installation.
2. Add the `--mpi_plugin` flag to your training command.
## Troubleshooting FAQ
My model training is hanging on the data loading step:
While each system is different, a few general suggestions:
Check your `$KMP_AFFINITY` environment setting
Adjust the number of data workers used to prepare data with the `--num_workers` setting. Increasing the number could help with dataset processing speed. However, to many workers could cause an OOM issue.
When I reload my pretrained model weights or checkpoints, I get `RuntimeError: Error(s) in loading state_dict for OpenFoldWrapper: Unexpected key(s) in state_dict:`
This suggests that your checkpoint / model weights are in OpenFold v1 format with outdated model layer names. Convert your weights/checkpoints following the [[Converting OpenFold v1 Weights]]
...@@ -5,12 +5,15 @@ ...@@ -5,12 +5,15 @@
:align: center :align: center
:alt: Comparison of OpenFold and AlphaFold2 predictions to the experimental structure of PDB 7KDX, chain B._ :alt: Comparison of OpenFold and AlphaFold2 predictions to the experimental structure of PDB 7KDX, chain B._
``` ```
Welcome to the Documentation for OpenFold, the fully open source, trainable, PyTorch-based reproduction of DeepMind's
A faithful but trainable PyTorch reproduction of DeepMind's
[AlphaFold 2](https://github.com/deepmind/alphafold). [AlphaFold 2](https://github.com/deepmind/alphafold).
Get started with OpenFold with our [Setup Guide](installation.md)! Get started with OpenFold with our [Setup Guide](installation.md)!
Here, you will find guides for:
- Learn how to [run inference with OpenFold](Inference.md)
- [Train your own OpenFold models](Training_OpenFold.md)
# Features # Features
OpenFold carefully reproduces (almost) all of the features of the original open OpenFold carefully reproduces (almost) all of the features of the original open
...@@ -45,10 +48,6 @@ implementations, respectively. ...@@ -45,10 +48,6 @@ implementations, respectively.
- **FlashAttention** support greatly speeds up MSA attention. - **FlashAttention** support greatly speeds up MSA attention.
- **DeepSpeed DS4Sci_EvoformerAttention kernel** is a memory-efficient attention kernel developed as part of a collaboration between OpenFold and the DeepSpeed4Science initiative. The kernel provides substantial speedups for training and inference, and significantly reduces the model's peak device memory requirement by 13X. The model is 15% faster during the initial training and finetuning stages, and up to 4x faster during inference. - **DeepSpeed DS4Sci_EvoformerAttention kernel** is a memory-efficient attention kernel developed as part of a collaboration between OpenFold and the DeepSpeed4Science initiative. The kernel provides substantial speedups for training and inference, and significantly reduces the model's peak device memory requirement by 13X. The model is 15% faster during the initial training and finetuning stages, and up to 4x faster during inference.
```{note}
TODO: Eventually replace this with some figures / results?
```
# Copyright Notice # Copyright Notice
While AlphaFold's and, by extension, OpenFold's source code is licensed under While AlphaFold's and, by extension, OpenFold's source code is licensed under
...@@ -94,7 +93,18 @@ If you use OpenProteinSet, please also cite: ...@@ -94,7 +93,18 @@ If you use OpenProteinSet, please also cite:
Any work that cites OpenFold should also cite [AlphaFold](https://www.nature.com/articles/s41586-021-03819-2) and [AlphaFold-Multimer](https://www.biorxiv.org/content/10.1101/2021.10.04.463034v1) if applicable. Any work that cites OpenFold should also cite [AlphaFold](https://www.nature.com/articles/s41586-021-03819-2) and [AlphaFold-Multimer](https://www.biorxiv.org/content/10.1101/2021.10.04.463034v1) if applicable.
```{toctree}
:hidden:
:caption: Guides
Installation.md
Inference.md
OpenFold_Training_setup.md
Training_OpenFold.md
```
```{note} ```{toctree}
TODO: Replace with final versions of both papers :hidden:
:caption: Reference
Aux_seq_files.md
OpenFold_Parameters.md
``` ```
\ No newline at end of file
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment