Commit 11d5fdf4 authored by jnwei's avatar jnwei
Browse files

Update training OpenFold docs with correct paths.

parent 9a6deab7
![header ](imgs/of_banner.png) ![header ](imgs/of_banner.png)
_Figure: Comparison of OpenFold and AlphaFold2 predictions to the experimental structure of PDB 7KDX, chain B._ _Figure: Comparison of OpenFold and AlphaFold2 predictions to the experimental structure of PDB 7KDX, chain B._
# OpenFold # OpenFold
A faithful but trainable PyTorch reproduction of DeepMind's A faithful but trainable PyTorch reproduction of DeepMind's
...@@ -10,6 +9,8 @@ A faithful but trainable PyTorch reproduction of DeepMind's ...@@ -10,6 +9,8 @@ A faithful but trainable PyTorch reproduction of DeepMind's
# Documentation # Documentation
See our new home for docs at [openfold.readthedocs.io](https://openfold.readthedocs.io/en/latest/), with instructions for installation and model inference/training. See our new home for docs at [openfold.readthedocs.io](https://openfold.readthedocs.io/en/latest/), with instructions for installation and model inference/training.
Much of the content from this page may be found [here.](https://github.com/aqlaboratory/openfold/blob/main/docs/source/original_readme.md)
## Copyright Notice ## Copyright Notice
While AlphaFold's and, by extension, OpenFold's source code is licensed under While AlphaFold's and, by extension, OpenFold's source code is licensed under
......
...@@ -14,7 +14,7 @@ For example, consider two protein as a case study ...@@ -14,7 +14,7 @@ For example, consider two protein as a case study
``` ```
- OpenProteinSet - OpenProteinSet
└── mmcifs └── mmcifs
── 3lrm.cif ── 3lrm.cif
└── 6kwc.cif └── 6kwc.cif
... ...
``` ```
...@@ -64,13 +64,13 @@ All together, the file directory would look like: ...@@ -64,13 +64,13 @@ All together, the file directory would look like:
└── pdb └── pdb
├── mmcif_cache.json ├── mmcif_cache.json
└── mmcifs └── mmcifs
── 3lrm.cif ── 3lrm.cif
└── 6kwc.cif └── 6kwc.cif
└── alignment_db └── alignment_db
── alignment_db_0.db ── alignment_db_0.db
── alignment_db_1.db ── alignment_db_1.db
... ...
── alignment_db_9.db ── alignment_db_9.db
└── alignment_db.index └── alignment_db.index
``` ```
......
...@@ -4,18 +4,20 @@ The multiple sequence alignments of OpenProteinSet and mmCIF structure files req ...@@ -4,18 +4,20 @@ The multiple sequence alignments of OpenProteinSet and mmCIF structure files req
### Pre-Requisites: ### Pre-Requisites:
- OpenFold conda environment. See [OpenFold Installation](Installation.md) for instructions on how to build this environment. - OpenFold conda environment. See [OpenFold Installation](Installation.md) for instructions on how to build this environment.
- In particular, the [AWS CLI](https://aws.amazon.com/cli/) is used to download data from RODA.
- For this guide, we assume that the OpenFold codebase is located at `$OF_DIR`. - For this guide, we assume that the OpenFold codebase is located at `$OF_DIR`.
## 1. Downloading alignments and structure files ## 1. Downloading alignments and structure files
To fetch all the alignments corresponding to the original PDB training set of OpenFold alongside their mmCIF 3D structures, you can run the following commands: To fetch all the alignments corresponding to the original PDB training set of OpenFold alongside their mmCIF 3D structures, you can run the following commands:
```bash ```bash
mkdir -p alignment_data/alignment_dir_roda --recursive --no-sign-request mkdir -p alignment_data/alignment_dir_roda
aws s3 cp s3://openfold/pdb/ alignment_data/alignment_dir_roda/ --recursive --no-sign-request aws s3 cp s3://openfold/pdb/ alignment_data/alignment_dir_roda/ --recursive --no-sign-request
mkdir pdb_data mkdir pdb_data
aws s3 cp s3://openfold/pdb_mmcif.zip pdb_data/ --no-sign-request aws s3 cp s3://openfold/pdb_mmcif.zip pdb_data/ --no-sign-request
aws s3 cp s3://openfold/duplicate_pdb_chains.txt pdb_data/ --no-sign-request aws s3 cp s3://openfold/duplicate_pdb_chains.txt . --no-sign-request
unzip pdb_mmcif.zip -d pdb_data
``` ```
The nested alignment directory structure is not yet exactly what OpenFold expects, so you can run the `flatten_roda.sh` script to convert them to the correct format: The nested alignment directory structure is not yet exactly what OpenFold expects, so you can run the `flatten_roda.sh` script to convert them to the correct format:
...@@ -102,7 +104,12 @@ python $OF_DIR/scripts/fasta_to_clusterfile.py \ ...@@ -102,7 +104,12 @@ python $OF_DIR/scripts/fasta_to_clusterfile.py \
## 5. Generating cluster-files ## 5. Generating cluster-files
As a last step, OpenFold requires ["cache" files](Aux_seq_files.md#chain-cache-files-and-mmcif-cache-files) with metadata information for each chain that are used for choosing templates and samples during training. As a last step, OpenFold requires ["cache" files](Aux_seq_files.md#chain-cache-files-and-mmcif-cache-files) with metadata information for each chain that are used for choosing templates and samples during training.
The mmCIF-cache is used for filtering templates and can be generated with the following script: The data caches for OpenProteinSet can be downloaded from RODA with the following:
```bash
aws s3 cp s3://openfold/data_caches/ pdb_data/ --recursive --no-sign-request
```
If you wish to create data caches for your own datasets, the steps to generate the cache are as follows:
```bash ```bash
mkdir pdb_data/data_caches mkdir pdb_data/data_caches
......
# Training OpenFold # Training OpenFold
## Background ## Background
This guide covers how to train an OpenFold model. These instructions focus on training a model for predicting monomers, but additional instructions are provided for training a monomer / multimer model. This guide covers how to train an OpenFold model for monomers. Some additional instructions are provided at the end for fine-tuning your model.
### Pre-requisites: ### Pre-requisites:
This guide requires the following: This guide requires the following:
- [Installation of OpenFold and dependencies](Installation.md) (Including jackhmmer and hhblits depedencies) - [Installation of OpenFold and dependencies](Installation.md) (Including jackhmmer and hhblits depedencies)
- A preprocessed dataset: - A preprocessed dataset:
- For this guide, we will use the original OpenFold dataset which is available on RODA (TODO: add link to processed dataset). - For this guide, we will use the original OpenFold dataset which is available on RODA, processed with [these instructions](OpenFold_Training_Setup.md).
- If you wish to construct your own dataset, [these instructions](OpenFold_Training_Setup.md) provide guidance for preprocessing alignments into an OpenFold format.
- GPUs configured with CUDA. Training OpenFold with CPUs only is not supported. - GPUs configured with CUDA. Training OpenFold with CPUs only is not supported.
Expected data directory structure: ## Training a new OpenFold model
#### Basic command
For a dataset that has the default alignment file structure, e.g.
``` ```
- OpenProteinSet -$DATA_DIR
└── alignments └── pdb_data
└── 2x7l_M ├── mmcifs
└── mgnify_hits.a3m ├── 3lrm.cif
└── bfd_uniclust_hits.a3m └── 6kwc.cif
└── uniref90_hits.a3m
└── pdb70_hits.hhr
... ...
└── mmcifs ├── obsolete.dat
└── 3u8d.cif ├── duplicate_pdb_chains.txt
└── 3lrm.cif └── data_caches
├── duplicate_pdb_chains.txt
└── data_caches
└── alignment_data
└── alignments
├── 3lrm_A/
├── 3lrm_B/
└── 6kwc_A/
... ...
└── mmcif_cache.json
└── chain_data_cache.json
``` ```
The `mmcif_cache.json` and the `chain_data_cache.json` provide metadata for the mmcif and the protein chains in the dataset. The basic command to train a new OpenFold model is:
## Training a new OpenFold model
#### Basic command
The basic command to train a new OpenFold model is
``` ```
python3 train_openfold.py $DATA_DIR/mmcifs/ $DATA_DIR/alignments/ template_mmcif_dir/ $OUTPUT_DIR \ python3 train_openfold.py $DATA_DIR/pdb/mmcifs $DATA_DIR/alignment_data/alignments $TEMPLATE_MMCIF_DIR $OUTPUT_DIR \
--max_template_date 2021-10-10 \ --max_template_date 2021-10-10 \
--train_chain_data_cache_path chain_data_cache.json \ --train_chain_data_cache_path $DATA_DIR/pdb_data/data_caches/chain_data_cache.json \
--template_release_dates_cache_path mmcif_cache.json \ --template_release_dates_cache_path $DATA_DIR/pdb_data/data_caches/mmcif_cache.json \
--config_preset initial_training \ --config_preset initial_training \
--seed 42 \ --seed 42 \
--obsolete_pdbs_file_path obsolete.dat \ --obsolete_pdbs_file_path $DATA_DIR/pdb_data/obsolete.dat \
--num_nodes 1 \ --num_nodes 1 \
--gpus 4 \ --gpus 4 \
--num_workers 4 \ --num_workers 4
``` ```
The required arguments are: The required arguments are:
- `mmcif_dir` : Mmcif files for the training set. - `mmcif_dir` : Mmcif files for the training set.
- `alignment_dir`: Alignments for the sequences in `mmcif_dir`, see expected directory structure - `alignments_dir`: Alignments for the sequences in `mmcif_dir`, see expected directory structure
- `template_mmcif_dir`: Template mmcif files with structures, which can be the same directory as mmcif_dir. The `max_template_date` and `template_release_dates_cache_path` will specify which templates will be allowed based on a date cutoff - `template_mmcif_dir`: Template mmcif files with structures, which can be the same directory as mmcif_dir. The `max_template_date` and `template_release_dates_cache_path` will specify which templates will be allowed based on a date cutoff
- `$OUTPUT_DIR` : Where model checkpoint files and other outputs will be saved. - `output_dir` : Where model checkpoint files and other outputs will be saved.
Commonly used flags include: Commonly used flags include:
- `config_preset`: Specifies which selection of hyperparameters should be used for initial model training. Commonly used configs are defined in `openfold/config.py` - `config_preset`: Specifies which selection of hyperparameters should be used for initial model training. Commonly used configs are defined in [`openfold/config.py`](https://github.com/aqlaboratory/openfold)
- `num_nodes` and `gpus`: Specifies number of nodes and GPUs available to train OpenFold. - `num_nodes` and `gpus`: Specifies number of nodes and GPUs available to train OpenFold.
- `seed` - Specifies random seed - `seed` - Specifies random seed
- `num_workers`: Number of CPU workers to assign for creating dataset examples - `num_workers`: Number of CPU workers to assign for creating dataset examples
...@@ -67,16 +70,40 @@ Commonly used flags include: ...@@ -67,16 +70,40 @@ Commonly used flags include:
Note that `--seed` must be specified to correctly configure training examples on multi-GPU training runs Note that `--seed` must be specified to correctly configure training examples on multi-GPU training runs
``` ```
#### Train with OpenFold Dataset Configuration
If the [OpenFold alignment database](OpenFold_Training_Setup.md#2-creating-alignment-dbs-optional) setup is used, resulting in a data directory such as:
```
- $DATA_DIR
├── duplicate_pdb_chains.txt
├── pdb_data
└── mmcifs
├── 3lrm.cif
└── 6kwc.cif
└── alignment_data
└── alignment_db
├── alignment_db_0.db
├── alignment_db_1.db
...
├── alignment_db_9.db
└── alignment_db.index
```
#### Train OpenFold with Different Dataset Configurations The training command will use the `alignment_index_path` argument to specify `db.index` files, e.g.:
If the [OpenFold alignment database](OpenFold_Training_Setup.md#2-creating-alignment-dbs-optional) setup is used, the training command will instead look like this:
```
python3 train_openfold.py $DATA_DIR/pdb_data/mmcifs $DATA_DIR/alignment_data/alignment_db $TEMPLATE_MMCIF_DIR $OUTPUT_DIR \
--max_template_date 2021-10-10 \
--train_chain_data_cache_path $DATA_DIR/pdb_data/data_caches/chain_data_cache.json \
--template_release_dates_cache_path $DATA_DIR/pdb_data/data_caches/mmcif_cache.json \
--alignment_index_path $DATA_DIR/pdb/alignment_db.index
--config_preset initial_training \
--seed 42 \
--obsolete_pdbs_file_path $DATA_DIR/pdb/obsolete.dat \
--num_nodes 1 \
--gpus 4 \
--num_workers 4
```
#### Additional command line flag options: #### Additional command line flag options:
...@@ -104,40 +131,29 @@ Here we provide brief descriptions for customizing your training run of OpenFold ...@@ -104,40 +131,29 @@ Here we provide brief descriptions for customizing your training run of OpenFold
- **Restart training from an existing checkpoint:** Use the `--resume_from_ckpt` to restart training from an existing checkpoint. - **Restart training from an existing checkpoint:** Use the `--resume_from_ckpt` to restart training from an existing checkpoint.
## Advanced Training Configurations ## Advanced Training Configurations
:::
### Training OpenFold Multimer
At this time, we do not have a multimer training set available. To prepare your own multimer training set, please see the instructions at [Data Processing - multimer]
The basic command for training a multimer model is then:
```
multimer training command here
```
The key differences are:
- Dataset configuration / preparation
### Fine tuning from existing model weights ### Fine tuning from existing model weights
If you have existing model weights, you can fine tune the model using the following command: If you have existing model weights, you can fine tune the model by specifying a checkpoint path with `--resume_from_ckpt` and `--resume_model_weights_only` arguments, e.g.
``` ```
python3 train_openfold.py mmcif_dir/ alignment_dir/ template_mmcif_dir/ $OUTPUT_DIR \ python3 train_openfold.py $DATA_DIR/mmcifs $DATA_DIR/alignment.db $TEMPLATE_MMCIF_DIR $OUTPUT_DIR \
--max_template_date 2021-10-10 \ --max_template_date 2021-10-10 \
--train_chain_data_cache_path chain_data_cache.json \ --train_chain_data_cache_path chain_data_cache.json \
--template_release_dates_cache_path mmcif_cache.json \ --template_release_dates_cache_path mmcif_cache.json \
--config_preset finetuning \ --config_preset finetuning \
--alignment_index_path $DATA_DIR/pdb/alignment_db.index \
--seed 4242022 \ --seed 4242022 \
--obsolete_pdbs_file_path obsolete.dat \ --obsolete_pdbs_file_path obsolete.dat \
--num_nodes 1 \ --num_nodes 1 \
--gpus 4 \ --gpus 4 \
--num_workers 4 \ --num_workers 4 \
--resume_from_ckpt $CHECKPOINT_PATH --resume_from_ckpt $CHECKPOINT_PATH \
--resume_model_weights_only --resume_model_weights_only
``` ```
If you have model parameters from OpenFold v1.x, you may need to convert your checkpoint file or parameter. See [[Converting OpenFold v1 Weights]] If you have model parameters from OpenFold v1.x, you may need to convert your checkpoint file or parameter. See [Converting OpenFold v1 Weights](convert_of_v1_weights.md) for more details.
### Using MPI ### Using MPI
...@@ -145,3 +161,10 @@ If MPI is configured on your system, and you would like to use MPI to train Open ...@@ -145,3 +161,10 @@ If MPI is configured on your system, and you would like to use MPI to train Open
1. Add the `mpi4py` package, which are available through pip and conda. Please see [mpi4py documentation](https://pypi.org/project/mpi4py/) for more instructions on installation. 1. Add the `mpi4py` package, which are available through pip and conda. Please see [mpi4py documentation](https://pypi.org/project/mpi4py/) for more instructions on installation.
2. Add the `--mpi_plugin` flag to your training command. 2. Add the `--mpi_plugin` flag to your training command.
### Training Multimer models
```{note}
Coming soon.
```
\ No newline at end of file
...@@ -25,8 +25,7 @@ $ python3 $OPENFOLD_DIR/train_openfold.py test_data_epoch/mmcifs test_data_epoch ...@@ -25,8 +25,7 @@ $ python3 $OPENFOLD_DIR/train_openfold.py test_data_epoch/mmcifs test_data_epoch
### How do I convert my checkpoints? ### How do I convert my checkpoints?
Use the `convert_v1_to_v2_weights.py` script in the `scripts` directory of the OpenFold repo: Use [`scripts/convert_v1_to_v2_weights.py`](https://github.com/aqlaboratory/openfold/blob/main/scripts/convert_v1_to_v2_weights.py) e.g.
e.g.
`python scripts/convert_v1_to_v2_weights.py checkpoints/6-209.ckpt checkpoints/6-209.ckpt.converted` `python scripts/convert_v1_to_v2_weights.py checkpoints/6-209.ckpt checkpoints/6-209.ckpt.converted`
......
...@@ -8,14 +8,15 @@ ...@@ -8,14 +8,15 @@
Welcome to the Documentation for OpenFold, the fully open source, trainable, PyTorch-based reproduction of DeepMind's Welcome to the Documentation for OpenFold, the fully open source, trainable, PyTorch-based reproduction of DeepMind's
[AlphaFold 2](https://github.com/deepmind/alphafold). [AlphaFold 2](https://github.com/deepmind/alphafold).
Here, you will find guides and documentation for: Here, you will find guides and documentation for:
- [Getting started with OpenFold](installation.md)! - [Getting started with OpenFold](installation.md)!
- Learn how to [run inference with OpenFold](Inference.md) - Learn how to [run inference with OpenFold](Inference.md)
- [Train your own OpenFold models](Training_OpenFold.md) - [Train your own OpenFold models](Training_OpenFold.md)
- Find guidance for setup and running OpenFold in the [FAQ](FAQ.md). - Find guidance for setup and running OpenFold in the [FAQ](FAQ.md).
Some portions of the documentation are still under migration from the original README, which can be found [here](original_readme.md) We also have a [Colab notebook](https://colab.research.google.com/github/aqlaboratory/openfold/blob/main/notebooks/OpenFold.ipynb) that can be used for single structure / multimer prediction.
Some portions of the documentation are still under migration from the original README, which can be found [here](original_readme.md).
# Features # Features
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment