Update training OpenFold docs with correct paths.

0b30bb85 · jnwei · Jennifer Wei · e338f208 · 0b30bb85 · 0b30bb85
Commit 0b30bb85 authored May 10, 2024 by jnwei Committed by Jennifer Wei May 13, 2024
6 changed files
--- a/README.md
+++ b/README.md
 ![header ](imgs/of_banner.png)
 _Figure: Comparison of OpenFold and AlphaFold2 predictions to the experimental structure of PDB 7KDX, chain B._

-
 # OpenFold

 A faithful but trainable PyTorch reproduction of DeepMind's 
@@ -10,6 +9,8 @@ A faithful but trainable PyTorch reproduction of DeepMind's
 # Documentation
 See our new home for docs at [openfold.readthedocs.io](https://openfold.readthedocs.io/en/latest/), with instructions for installation and model inference/training.

+Much of the content from this page may be found [here.](https://github.com/aqlaboratory/openfold/blob/main/docs/source/original_readme.md)
+
 ## Copyright Notice

 While AlphaFold's and, by extension, OpenFold's source code is licensed under

--- a/docs/source/Aux_seq_files.md
+++ b/docs/source/Aux_seq_files.md
@@ -14,7 +14,7 @@ For example, consider two protein as a case study
 ```
 - OpenProteinSet
  └── mmcifs 
-	 └── 3lrm.cif
+	 ├── 3lrm.cif
 	 └── 6kwc.cif
 	 ...
 ```
@@ -64,13 +64,13 @@ All together, the file directory would look like:
  └── pdb
 	  ├── mmcif_cache.json 
 	  └── mmcifs 
-		  └── 3lrm.cif
+		  ├── 3lrm.cif
 		  └── 6kwc.cif
 	  └── alignment_db
-		  └── alignment_db_0.db 
-          └── alignment_db_1.db
+		  ├── alignment_db_0.db 
+          ├── alignment_db_1.db
          ...
-          └── alignment_db_9.db
+          ├── alignment_db_9.db
 		  └── alignment_db.index 
 ```


--- a/docs/source/OpenFold_Training_Setup.md
+++ b/docs/source/OpenFold_Training_Setup.md
@@ -4,18 +4,20 @@ The multiple sequence alignments of OpenProteinSet and mmCIF structure files req

 ### Pre-Requisites:
 - OpenFold conda environment. See [OpenFold Installation](Installation.md) for instructions on how to build this environment. 
+- In particular, the [AWS CLI](https://aws.amazon.com/cli/) is used to download data from RODA.
 - For this guide, we assume that the OpenFold codebase is located at `$OF_DIR`.

 ## 1. Downloading alignments and structure files
 To fetch all the alignments corresponding to the original PDB training set of OpenFold alongside their mmCIF 3D structures, you can run the following commands:

 ```bash
-mkdir -p alignment_data/alignment_dir_roda --recursive --no-sign-request
+mkdir -p alignment_data/alignment_dir_roda
 aws s3 cp s3://openfold/pdb/ alignment_data/alignment_dir_roda/ --recursive --no-sign-request

 mkdir pdb_data
 aws s3 cp s3://openfold/pdb_mmcif.zip pdb_data/ --no-sign-request
-aws s3 cp s3://openfold/duplicate_pdb_chains.txt pdb_data/ --no-sign-request
+aws s3 cp s3://openfold/duplicate_pdb_chains.txt . --no-sign-request
+unzip pdb_mmcif.zip -d pdb_data
 ```

 The nested alignment directory structure is not yet exactly what OpenFold expects, so you can run the `flatten_roda.sh` script to convert them to the correct format:
@@ -102,7 +104,12 @@ python $OF_DIR/scripts/fasta_to_clusterfile.py \
 ## 5. Generating cluster-files
 As a last step, OpenFold requires ["cache" files](Aux_seq_files.md#chain-cache-files-and-mmcif-cache-files) with metadata information for each chain that are used for choosing templates and samples during training.

-The mmCIF-cache is used for filtering templates and can be generated with the following script:
+The data caches for OpenProteinSet can be downloaded from RODA with the following:
+
+```bash
+aws s3 cp s3://openfold/data_caches/ pdb_data/ --recursive --no-sign-request
+```
+If you wish to create data caches for your own datasets, the steps to generate the cache are as follows:

 ```bash
 mkdir pdb_data/data_caches

--- a/docs/source/Training_OpenFold.md
+++ b/docs/source/Training_OpenFold.md
 # Training OpenFold
 ## Background

-This guide covers how to train an OpenFold model. These instructions focus on training a model for predicting monomers, but additional instructions are provided for training a monomer / multimer model. 
+This guide covers how to train an OpenFold model for monomers. Some additional instructions are provided at the end for fine-tuning your model.

 ### Pre-requisites: 

 This guide requires the following:
 - [Installation of OpenFold and dependencies](Installation.md) (Including jackhmmer and hhblits depedencies)
 - A preprocessed dataset:
-	- For this guide, we will use the original OpenFold dataset which is available on RODA (TODO: add link to processed dataset).
-	- If you wish to construct your own dataset, [these instructions](OpenFold_Training_Setup.md) provide guidance for preprocessing alignments into an OpenFold format. 
+	- For this guide, we will use the original OpenFold dataset which is available on RODA, processed with [these instructions](OpenFold_Training_Setup.md).
 - GPUs configured with CUDA. Training OpenFold with CPUs only is not supported. 

-Expected data directory structure:
+## Training a new OpenFold model 
+
+#### Basic command
+
+For a dataset that has the default alignment file structure, e.g. 
+
 ```
- OpenProteinSet 
-  └── alignments 
-	  └── 2x7l_M
-		  └── mgnify_hits.a3m
-		  └── bfd_uniclust_hits.a3m
-		  └── uniref90_hits.a3m
-		  └── pdb70_hits.hhr 
+-$DATA_DIR
+  └── pdb_data 
+ 	   ├──  mmcifs 
+ 	   	    ├── 3lrm.cif
+ 	   	    └── 6kwc.cif
 			...
-  └── mmcifs 
-	  └── 3u8d.cif
-	  └── 3lrm.cif
+	   ├── obsolete.dat 	
+	   ├── duplicate_pdb_chains.txt 
+	   └── data_caches 
+	   		├── duplicate_pdb_chains.txt 
+	   		└── data_caches 
+  └── alignment_data
+	  └── alignments
+   	  		├── 3lrm_A/ 
+   	  		├── 3lrm_B/ 
+	  	    └── 6kwc_A/
 	  		...
-  └── mmcif_cache.json 
-  └── chain_data_cache.json 
 ```

-The `mmcif_cache.json` and the `chain_data_cache.json` provide metadata for the mmcif and the protein chains in the dataset.
+The basic command to train a new OpenFold model is: 

-## Training a new OpenFold model 
-
-#### Basic command
-The basic command to train a new OpenFold model is 
 ```
-python3 train_openfold.py $DATA_DIR/mmcifs/ $DATA_DIR/alignments/ template_mmcif_dir/ $OUTPUT_DIR \
+python3 train_openfold.py $DATA_DIR/pdb/mmcifs $DATA_DIR/alignment_data/alignments $TEMPLATE_MMCIF_DIR $OUTPUT_DIR \
    --max_template_date 2021-10-10 \ 
-    --train_chain_data_cache_path chain_data_cache.json \
-    --template_release_dates_cache_path mmcif_cache.json \ 
+    --train_chain_data_cache_path $DATA_DIR/pdb_data/data_caches/chain_data_cache.json \
+    --template_release_dates_cache_path $DATA_DIR/pdb_data/data_caches/mmcif_cache.json \ 
 	--config_preset initial_training \
    --seed 42 \
-    --obsolete_pdbs_file_path obsolete.dat \
+    --obsolete_pdbs_file_path $DATA_DIR/pdb_data/obsolete.dat \
    --num_nodes 1 \
    --gpus 4 \
-    --num_workers 4 \
+    --num_workers 4
 ```

 The required arguments are:
 - `mmcif_dir` : Mmcif files for the training set.
- `alignment_dir`: Alignments for the sequences in `mmcif_dir`, see expected directory structure 
+- `alignments_dir`: Alignments for the sequences in `mmcif_dir`, see expected directory structure 
 - `template_mmcif_dir`:  Template mmcif files with structures, which can be the same directory as mmcif_dir. The `max_template_date` and `template_release_dates_cache_path` will specify which templates will be allowed based on a date cutoff
- `$OUTPUT_DIR` : Where model checkpoint files and other outputs will be saved. 
+- `output_dir` : Where model checkpoint files and other outputs will be saved. 

 Commonly used flags include:
- `config_preset`: Specifies which selection of hyperparameters should be used for initial model training. Commonly used configs are defined in `openfold/config.py` 
+- `config_preset`: Specifies which selection of hyperparameters should be used for initial model training. Commonly used configs are defined in [`openfold/config.py`](https://github.com/aqlaboratory/openfold)
 - `num_nodes` and `gpus`:  Specifies number of nodes and GPUs available to train OpenFold.
 - `seed` - Specifies random seed
 - `num_workers`: Number of CPU workers to assign for creating dataset examples
@@ -67,16 +70,40 @@ Commonly used flags include:
 Note that `--seed` must be specified to correctly configure training examples on multi-GPU training runs
 ```

+#### Train with OpenFold Dataset Configuration

+If the [OpenFold alignment database](OpenFold_Training_Setup.md#2-creating-alignment-dbs-optional) setup is used, resulting in a data directory such as:
+```
+- $DATA_DIR 
+  ├── duplicate_pdb_chains.txt
+  ├── pdb_data
+	  └── mmcifs 
+		  ├── 3lrm.cif
+		  └── 6kwc.cif
+  └── alignment_data 
+	  └── alignment_db
+		  ├── alignment_db_0.db 
+          ├── alignment_db_1.db
+          ...
+          ├── alignment_db_9.db
+		  └── alignment_db.index 
+```

-#### Train OpenFold with Different Dataset Configurations
-
-If the [OpenFold alignment database](OpenFold_Training_Setup.md#2-creating-alignment-dbs-optional) setup is used, the training command will instead look like this: 
-
-
-
-
+The training command will use the `alignment_index_path` argument to specify `db.index` files, e.g.: 

+```
+python3 train_openfold.py $DATA_DIR/pdb_data/mmcifs $DATA_DIR/alignment_data/alignment_db $TEMPLATE_MMCIF_DIR $OUTPUT_DIR \
+    --max_template_date 2021-10-10 \ 
+    --train_chain_data_cache_path $DATA_DIR/pdb_data/data_caches/chain_data_cache.json \
+    --template_release_dates_cache_path $DATA_DIR/pdb_data/data_caches/mmcif_cache.json \ 
+	--alignment_index_path $DATA_DIR/pdb/alignment_db.index 
+	--config_preset initial_training \
+    --seed 42 \
+    --obsolete_pdbs_file_path $DATA_DIR/pdb/obsolete.dat \
+    --num_nodes 1 \
+    --gpus 4 \
+    --num_workers 4
+```

 #### Additional command line flag options:

@@ -104,40 +131,29 @@ Here we provide brief descriptions for customizing your training run of OpenFold
 - **Restart training from an existing checkpoint:** Use the `--resume_from_ckpt` to restart training from an existing checkpoint.

 ## Advanced Training Configurations 
-
-### Training OpenFold Multimer 
-
-At this time, we do not have a multimer training set available. To prepare your own multimer training set, please see the instructions at [Data Processing - multimer] 
-
-The basic command for training a multimer model is then:
-
-```
-multimer training command here
-```
-
-The key differences are:
- Dataset configuration / preparation
+:::

 ### Fine tuning from existing model weights 

-If you have existing model weights, you can fine tune the model using the following command:
+If you have existing model weights, you can fine tune the model by specifying a checkpoint path with `--resume_from_ckpt` and `--resume_model_weights_only` arguments, e.g. 

 ```
-python3 train_openfold.py mmcif_dir/ alignment_dir/ template_mmcif_dir/ $OUTPUT_DIR \
+python3 train_openfold.py $DATA_DIR/mmcifs $DATA_DIR/alignment.db $TEMPLATE_MMCIF_DIR $OUTPUT_DIR \
    --max_template_date 2021-10-10 \ 
    --train_chain_data_cache_path chain_data_cache.json \
    --template_release_dates_cache_path mmcif_cache.json \ 
 	--config_preset finetuning \
+	--alignment_index_path $DATA_DIR/pdb/alignment_db.index \ 
    --seed 4242022 \
    --obsolete_pdbs_file_path obsolete.dat \
    --num_nodes 1 \
    --gpus 4 \
    --num_workers 4 \
-	--resume_from_ckpt $CHECKPOINT_PATH
+	--resume_from_ckpt $CHECKPOINT_PATH \
 	--resume_model_weights_only
 ```

-If you have model parameters from OpenFold v1.x, you may need to convert your checkpoint file or parameter. See [[Converting OpenFold v1 Weights]]  
+If you have model parameters from OpenFold v1.x, you may need to convert your checkpoint file or parameter. See [Converting OpenFold v1 Weights](convert_of_v1_weights.md) for more details. 

 ### Using MPI

@@ -145,3 +161,10 @@ If MPI is configured on your system, and you would like to use MPI to train Open

 1. Add the `mpi4py` package, which are available through pip and conda. Please see [mpi4py documentation](https://pypi.org/project/mpi4py/) for more instructions on installation.
 2. Add the `--mpi_plugin` flag to your training command.
+
+
+### Training Multimer  models
+
+```{note}
+Coming soon.
+```
\ No newline at end of file
--- a/docs/source/convert_of_v1_weights.md
+++ b/docs/source/convert_of_v1_weights.md
@@ -25,8 +25,7 @@ $ python3 $OPENFOLD_DIR/train_openfold.py test_data_epoch/mmcifs test_data_epoch

 ### How do I convert my checkpoints? 

-Use the `convert_v1_to_v2_weights.py` script in the `scripts` directory of the OpenFold repo:
-e.g.
+Use [`scripts/convert_v1_to_v2_weights.py`](https://github.com/aqlaboratory/openfold/blob/main/scripts/convert_v1_to_v2_weights.py) e.g.

 	`python scripts/convert_v1_to_v2_weights.py checkpoints/6-209.ckpt checkpoints/6-209.ckpt.converted`


--- a/docs/source/index.md
+++ b/docs/source/index.md
@@ -8,14 +8,15 @@
 Welcome to the Documentation for OpenFold, the fully open source, trainable, PyTorch-based reproduction of DeepMind's 
 [AlphaFold 2](https://github.com/deepmind/alphafold).

-
 Here, you will find guides and documentation for:
 - [Getting started with OpenFold](installation.md)!
 - Learn how to [run inference with OpenFold](Inference.md)
 - [Train your own OpenFold models](Training_OpenFold.md)
 - Find guidance for setup and running OpenFold in the [FAQ](FAQ.md).

-Some portions of the documentation are still under migration from the original README, which can be found [here](original_readme.md)
+We also have a [Colab notebook](https://colab.research.google.com/github/aqlaboratory/openfold/blob/main/notebooks/OpenFold.ipynb) that can be used for single structure / multimer prediction.
+
+Some portions of the documentation are still under migration from the original README, which can be found [here](original_readme.md).

 # Features