add convert v1 weights instructions

24dc3af0 · jnwei · 60cfbcc1 · 24dc3af0 · 24dc3af0 · 24dc3af0
Commit 24dc3af0 authored May 08, 2024 by jnwei
8 changed files
--- a/docs/source/Inference.md
+++ b/docs/source/Inference.md
 # OpenFold Inference 
 In this guide, we will cover how to use OpenFold to make structure predictions.
 ## Background
 We currently offer three modes of inference prediction:
@@ -9,7 +10,7 @@ We currently offer three modes of inference prediction:
 - Multimer
 - Single Sequence (Soloseq) 
-This guide will focus on monomer prediction, the next sections will describe Multimer and Single Sequence prediction. 
+This guide will focus on monomer prediction, the next sections will describe [Multimer](Multimer_Inference.md) and [Single Sequence](Single_Sequence_Inference.md) prediction. 
 ### Pre-requisites: 
@@ -19,7 +20,7 @@ This guide will focus on monomer prediction, the next sections will describe Mul
 ## Running AlphaFold Model Inference 
-The script `run_pretrained_openfold.py` performs model inference. We will go through the steps of how to use this script.
+The script [`run_pretrained_openfold.py`](https://github.com/aqlaboratory/openfold/blob/main/run_pretrained_openfold.py) performs model inference. We will go through the steps of how to use this script.
 ### Download Model Parameters 
@@ -45,7 +46,7 @@ If you choose to use a different directory, you may make a symlink to the `openf
 ### Model Inference 
-The input to `run_pretrained_openfold.py` is a directory of FASTA files. AlphaFold-style models also require a sequence alignment to perform inference.
+The input to [`run_pretrained_openfold.py`](https://github.com/aqlaboratory/openfold/blob/main/run_pretrained_openfold.py) is a directory of FASTA files. AlphaFold-style models also require a sequence alignment to perform inference.
 If you do not have sequence alignments for your input sequences, you can compute them using the inference script directly by following the instructions for the following section [inference without pre-computed alignments](#model-inference-without-pre-computed-alignments).
@@ -92,7 +93,7 @@ where `${PRECOMPUTED_ALIGNMENTS}` is a directory that contains alignments. A sam
 ```
 alignments
-└── gfp
+└── 6KWC_1 
    ├── bfd_uniclust_hits.a3m
    ├── hhsearch_output.hhr
    ├── mgnify_hits.sto

--- a/docs/source/Multimer_Inference.md
+++ b/docs/source/Multimer_Inference.md
@@ -48,7 +48,9 @@ and [UniRef30](https://uniclust.mmseqs.com/) (previously UniClust30) databases.
 Multimer inference can also run with the older database versions if desired. 
 ```
-## Inference command
+## Running Multimer Inference 
+The [`run_pretrained_openfold.py`](https://github.com/aqlaboratory/openfold/blob/main/run_pretrained_openfold.py) script can be used to run multimer inference with the follwoing command.
 ```bash
 python3 run_pretrained_openfold.py \
@@ -70,6 +72,8 @@ python3 run_pretrained_openfold.py \
    --output_dir ./ 
 ```
+Note that template searching in the multimer pipeline 
+uses HMMSearch with the PDB SeqRes database, replacing HHSearch and PDB70 used in the monomer pipeline.
 As with monomer inference, if you've already computed alignments for the query, you can use 
-the `--use_precomputed_alignments` option. Note that template searching in the multimer pipeline 
+the `--use_precomputed_alignments` option.
-uses HMMSearch with the PDB SeqRes database, replacing HHSearch and PDB70 used in the monomer pipeline.
\ No newline at end of file
\ No newline at end of file
--- a/docs/source/OpenFold_Training_Setup.md
+++ b/docs/source/OpenFold_Training_Setup.md
 # Setting up the OpenFold PDB training set from RODA
-The multiple sequence alignments of OpenProteinSet and mmCIF structure files required to train OpenFold are freely available at the [Registry of Open Data on AWS (RODA)](https://registry.opendata.aws/openfold/). Additionally, OpenFold requires some postprocessing and [auxiliary files](Aux_seq_files.md) for training that need to be generated from the AWS data manually. This documentation is intended to give a full overview of those steps starting from the data download, assuming that the OpenFold codebase has already been set up on your system previously at the path `$OF_DIR` and the `openfold` environment is activated.
+The multiple sequence alignments of OpenProteinSet and mmCIF structure files required to train OpenFold are freely available at the [Registry of Open Data on AWS (RODA)](https://registry.opendata.aws/openfold/). Additionally, OpenFold requires some postprocessing and [auxiliary files](Aux_seq_files.md) for training that need to be generated from the AWS data manually. This documentation is intended to give a full overview of those steps starting from the data download.
+### Pre-Requisites:
+- OpenFold conda environment. See [OpenFold Installation](Installation.md) for instructions on how to build this environment. 
+- For this guide, we assume that the OpenFold codebase is located at `$OF_DIR`.
 ## 1. Downloading alignments and structure files
 To fetch all the alignments corresponding to the original PDB training set of OpenFold alongside their mmCIF 3D structures, you can run the following commands:
@@ -60,14 +64,13 @@ python $OF_DIR/scripts/expand_alignment_duplicates.py \
    pdb_data/duplicate_pdb_chains.txt
 ```
-As an optional check, the following command should return 634,434:
+As an optional check, the following command should return $634,434$:
 ```bash
 ls alignment_data/alignments/ | wc -l
 ```
 ## 4. Generating cluster-files
-[JENNIFER: We could simply upload this cluster file as well and having it set up by the user is pretty unnecessary, but the scripts would still be useful to have for setup of new datasets]\
 The AlphaFold dataloader adjusts the sampling probability of chains by their inverse cluster size, so we need to generate these sequence clusters for our training set.
 As a first step, we'll need a `.fasta` file of all sequences in the training set. This can be generated with the following scripts, depending on how you set up your alignment data in the previous steps:
@@ -79,8 +82,6 @@ python $OF_DIR/scripts/alignment_data_to_fasta.py \
    --alignment_dir alignment_data/alignments
 ```
-[JENNIFER: These scripts replace `data_dir_to_fasta.py` which is horribly slow as it reparses all mmCIF structure files. What those scripts do instead is fetch the sequence information from the >query line in the MSA files which is a lot faster. Technically this means that the chains in the generated .fasta may not 100% mirror the available mmCIF files in the PDB directory if there were some MSA generation errors for some of those structures. However as OF only trains on the structures with alignments available this is fine in practice, but I just wanted to note that the output of `data_dir_to_fasta.py` may technically not be 100% identical.]
 **Use this if you set up the `alignment_db` files:**
 ```bash
 python $OF_DIR/scripts/alignment_data_to_fasta.py \
@@ -88,7 +89,7 @@ python $OF_DIR/scripts/alignment_data_to_fasta.py \
    --alignment_db_index alignment_data/alignment_dbs/alignment_db.index
 ```
-Next, we need to generate a cluster file at 40% sequence identity, which will contain all chains in a particular cluster on the same line. You'll need [MMSeqs2](https://github.com/soedinglab/MMseqs2?tab=readme-ov-file#installation) for this as well, which can be set up either in a conda environment or as a binary. (JENNIFER: do we want to add that as a dependency?)
+Next, we need to generate a cluster file at 40% sequence identity, which will contain all chains in a particular cluster on the same line. You'll need [MMSeqs2](https://github.com/soedinglab/MMseqs2?tab=readme-ov-file#installation) for this as well, which can be set up either in a conda environment or as a binary.
 ```bash
 python $OF_DIR/scripts/fasta_to_clusterfile.py \
@@ -99,7 +100,7 @@ python $OF_DIR/scripts/fasta_to_clusterfile.py \
 ```
 ## 5. Generating cluster-files
-As a last step, OpenFold requires "cache" files (JENNIFER: insert cross-link here) with metadata information for each chain that are used for choosing templates and samples during training.
+As a last step, OpenFold requires ["cache" files](Aux_seq_files.md#chain-cache-files-and-mmcif-cache-files) with metadata information for each chain that are used for choosing templates and samples during training.
 The mmCIF-cache is used for filtering templates and can be generated with the following script:

--- a/docs/source/Single_Sequence_Inference.md
+++ b/docs/source/Single_Sequence_Inference.md
@@ -4,7 +4,7 @@ MSA-free sequence to structure prediction using the [ESM-1b model](https://githu
 To run inference for a sequence using the SoloSeq single-sequence model, you can either precompute ESM-1b embeddings in bulk, or you can generate them during inference.
-For generating ESM-1b embeddings in bulk, use the provided script: `scripts/precompute_embeddings.py`. The script takes a directory of FASTA files (one sequence per file) and generates ESM-1b embeddings in the same format and directory structure as required by SoloSeq. Following is an example command to use the script:
+For generating ESM-1b embeddings in bulk, use the provided script: [`scripts/precompute_embeddings.py`](https://github.com/aqlaboratory/openfold/blob/main/scripts/precompute_embeddings.py). The script takes a directory of FASTA files (one sequence per file) and generates ESM-1b embeddings in the same format and directory structure as required by SoloSeq. Following is an example command to use the script:
 ```shell
 python scripts/precompute_embeddings.py fasta_dir/ embeddings_output_dir/

--- a/docs/source/Training_OpenFold.md
+++ b/docs/source/Training_OpenFold.md
@@ -8,12 +8,11 @@ This guide covers how to train an OpenFold model. These instructions focus on tr
 This guide requires the following:
 - [Installation of OpenFold and dependencies](Installation.md) (Including jackhmmer and hhblits depedencies)
 - A preprocessed dataset:
-	- For this guide, we will use the original OpenFold dataset which is available on RODA. This dataset can be downloaded with the following command:
+	- For this guide, we will use the original OpenFold dataset which is available on RODA (TODO: add link to processed dataset).
-		`./scripts/download_roda_dbs.sh <dst_path>`[Download the dataset used to train the OpenFold model]  
 	- If you wish to construct your own dataset, [these instructions](OpenFold_Training_Setup.md) provide guidance for preprocessing alignments into an OpenFold format. 
 - GPUs configured with CUDA. Training OpenFold with CPUs only is not supported. 
-Expected directory structure:
+Expected data directory structure:
 ```
 - OpenProteinSet 
  └── alignments 
@@ -156,4 +155,4 @@ My model training is hanging on the data loading step:
 		 Adjust the number of data workers used to prepare data with the `--num_workers` setting. Increasing the number could help with dataset processing speed. However, to many workers could cause an OOM issue. 
 When I reload my pretrained model weights or checkpoints, I get `RuntimeError: Error(s) in loading state_dict for OpenFoldWrapper: Unexpected key(s) in state_dict:`
-	This suggests that your checkpoint / model weights are in OpenFold v1 format with outdated model layer names. Convert your weights/checkpoints following the [[Converting OpenFold v1 Weights]] 
+	This suggests that your checkpoint / model weights are in OpenFold v1 format with outdated model layer names. Convert your weights/checkpoints following [this guide](convert_of_v1_weights.md).
--- a/docs/source/conf.py
+++ b/docs/source/conf.py
@@ -9,7 +9,6 @@
 project = 'OpenFold'
 copyright = '2024, OpenFold Team'
 author = 'OpenFold Team'
-release = '2.0.0'
 # -- General configuration ---------------------------------------------------
 # https://www.sphinx-doc.org/en/master/usage/configuration.html#general-configuration
@@ -28,5 +27,5 @@ exclude_patterns = []
 html_theme = 'furo'
 html_static_path = ['_static']
-myst_enable_extensions = ["colon_fence"]
+myst_enable_extensions = ["colon_fence", "dollarmath", "amsmath"]
--- a/docs/source/convert_of_v1_weights.md
+++ b/docs/source/convert_of_v1_weights.md
+## Weights Renaming 
+As part of the [OpenFold v2 update](https://github.com/aqlaboratory/openfold/releases/tag/v2.0.0) with the integration of multimer prediction, certain model layers of the AlphaFold model were renamed. For example.
+`module.model.template_angle_embedder.*` is now referred to as 
+`module.model.template_embedder.template_single_embedder.*`
+If you have some checkpoints that were trained using OpenFold v1 or older, and now want to resume training on OpenFold v2, you may need to convert your checkpoints.
+## FAQ
+### Do I need to convert my checkpoints / model weights?
+If you want to run inference or resume training from a checkpoint that was trained with OpenFold V1, you will need to convert your checkpoint.
+If you want load model weights only, without starting from a specific time step, then you should not need to convert your checkpoints. The training of the model will begin from `step=0` in this case. To do so, you'll need both the `--resume_from_ckpt` and `--resume_model_weights_only`  flags. This example allows you train starting from the pre-trained openfold weights:
+```bash
+$ python3 $OPENFOLD_DIR/train_openfold.py test_data_epoch/mmcifs test_data_epoch/alignments test_data_epoch/template_mmcifs $OUTPUT_DIR 2021-09-30 \
+	...
+	--resume_from_ckpt openfold/resources/openfold_params/finetuning_2.pt \
+	--resume_model_weights_only
+```
+### How do I convert my checkpoints? 
+Use the `convert_v1_to_v2_weights.py` script in the `scripts` directory of the OpenFold repo:
+e.g.
+	`python scripts/convert_v1_to_v2_weights.py checkpoints/6-209.ckpt checkpoints/6-209.ckpt.converted`
+Then, to resume training, set the following flags:
+```bash
+$ python3 $OPENFOLD_DIR/train_openfold.py test_data_epoch/mmcifs test_data_epoch/alignments test_data_epoch/template_mmcifs $OUTPUT_DIR 2021-09-30 \
+	...
+	--resume_from_ckpt checkpoints/6-209.ckpt.converted
+```
\ No newline at end of file
--- a/docs/source/installation.md
+++ b/docs/source/installation.md
@@ -4,7 +4,7 @@ In this guide, we will OpenFold and its dependencies.
 **Pre-requisites**
-This package is currently supported for CUDA 12 and Pytorch 2.1. All dependencies are listed in the `environment.yml`
+This package is currently supported for CUDA 11 and Pytorch 1.12. All dependencies are listed in the [`environment.yml`](https://github.com/aqlaboratory/openfold/blob/main/environment.yml)
 ## Instructions
 :::