Commit 2ada4f8d authored by Sam DeLuca's avatar Sam DeLuca
Browse files

Merge remote-tracking branch 'origin/main' into run-multiple-models

parents 07b522b7 c871ccf3
......@@ -26,7 +26,6 @@ OpenFold also supports inference using AlphaFold's official parameters.
OpenFold has the following advantages over the reference implementation:
- **Faster inference** on GPU for chains with < 1500 residues.
- **Inference on extremely long chains**, made possible by our implementation of low-memory attention
([Rabe & Staats 2021](https://arxiv.org/pdf/2112.05682.pdf)). OpenFold can predict the structures of
sequences with more than 4000 residues on a single A100, and even longer ones with CPU offloading.
......@@ -35,17 +34,19 @@ kernels support in-place attention during inference and training. They use
4x and 5x less GPU memory than equivalent FastFold and stock PyTorch
implementations, respectively.
- **Efficient alignment scripts** using the original AlphaFold HHblits/JackHMMER pipeline or [ColabFold](https://github.com/sokrypton/ColabFold)'s, which uses the faster MMseqs2 instead. We've used them to generate millions of alignments.
- **Faster inference** on GPU for short chains.
## Installation (Linux)
All Python dependencies are specified in `environment.yml`. For producing sequence
alignments, you'll also need `kalign`, the [HH-suite](https://github.com/soedinglab/hh-suite),
and one of {`jackhmmer`, [MMseqs2](https://github.com/soedinglab/mmseqs2) (nightly build)}
installed on on your system. Finally, some download scripts require `aria2c`.
installed on on your system. You'll need `git-lfs` to download OpenFold parameters.
Finally, some download scripts require `aria2c`.
For convenience, we provide a script that installs Miniconda locally, creates a
`conda` virtual environment, installs all Python dependencies, and downloads
useful resources (including DeepMind's pretrained parameters). Run:
useful resources, including both sets of model parameters. Run:
```bash
scripts/install_third_party_dependencies.sh
......@@ -302,6 +303,10 @@ multi-node distributed training, validation, and so on. For more information,
consult PyTorch Lightning documentation and the `--help` flag of the training
script.
If you're using your own MSAs or MSAs from the RODA repository, make sure that
the `alignment_dir` contains one directory per chain and that each of these
contains alignments (.sto, .a3m, and .hhr) corresponding to that chain.
Note that, despite its variable name, `mmcif_dir` can also contain PDB files
or even ProteinNet .core files. To emulate the AlphaFold training procedure,
which uses a self-distillation set subject to special preprocessing steps, use
......
......@@ -4,7 +4,7 @@
"metadata": {
"accelerator": "GPU",
"colab": {
"name": "Copy of OpenFold.ipynb",
"name": "OpenFold.ipynb",
"provenance": [],
"collapsed_sections": []
},
......@@ -31,7 +31,7 @@
"\n",
"OpenFold is a trainable PyTorch reimplementation of AlphaFold 2. For the purposes of inference, it is practically identical to the original (\"practically\" because ensembling is excluded from OpenFold (recycling is enabled, however)).\n",
"\n",
"In this notebook, OpenFold is run with DeepMind's publicly released parameters for AlphaFold 2.\n",
"In this notebook, OpenFold is run with your choice of our original OpenFold parameters or DeepMind's publicly released parameters for AlphaFold 2.\n",
"\n",
"**Note**\n",
"\n",
......@@ -43,7 +43,7 @@
"\n",
"**Licenses**\n",
"\n",
"This Colab uses the [AlphaFold model parameters](https://github.com/deepmind/alphafold/#model-parameters-license), made available under the Creative Commons Attribution 4.0 International ([CC BY 4.0](https://creativecommons.org/licenses/by/4.0/legalcode)) license. The Colab itself is provided under the [Apache 2.0 license](https://www.apache.org/licenses/LICENSE-2.0). See the full license statement below.\n",
"This Colab supports inference with the [AlphaFold model parameters](https://github.com/deepmind/alphafold/#model-parameters-license), made available under the Creative Commons Attribution 4.0 International ([CC BY 4.0](https://creativecommons.org/licenses/by/4.0/legalcode)) license. The Colab itself is provided under the [Apache 2.0 license](https://www.apache.org/licenses/LICENSE-2.0). See the full license statement below.\n",
"\n",
"**More information**\n",
"\n",
......@@ -111,6 +111,11 @@
" %shell wget -q -P /content \\\n",
" https://git.scicore.unibas.ch/schwede/openstructure/-/raw/7102c63615b64735c4941278d92b554ec94415f8/modules/mol/alg/src/stereo_chemical_props.txt\n",
" pbar.update(1)\n",
"\n",
" # Install git-lfs\n",
" %shell sudo apt-get install git-lfs\n",
" %shell git lfs install\n",
"\n",
"except subprocess.CalledProcessError as captured:\n",
" print(captured)\n",
" raise"
......@@ -132,13 +137,10 @@
"\n",
"GIT_REPO = 'https://github.com/aqlaboratory/openfold'\n",
"\n",
"OPENFOLD_PARAM_FILE_ID = \"1OpeMrfWEUSD_KqffbPqd5p7WsJjlC3ZE\"\n",
"OPENFOLD_PARAM_SOURCE_URL = \"https://huggingface.co/nz/OpenFold\"\n",
"ALPHAFOLD_PARAM_SOURCE_URL = 'https://storage.googleapis.com/alphafold/alphafold_params_2022-01-19.tar'\n",
"OPENFOLD_PARAMS_DIR = './openfold/openfold/resources/'\n",
"OPENFOLD_PARAMS_DIR = './openfold/openfold/resources/openfold_params'\n",
"ALPHAFOLD_PARAMS_DIR = './openfold/openfold/resources/params'\n",
"OPENFOLD_PARAMS_PATH = os.path.join(\n",
" OPENFOLD_PARAMS_DIR, \"openfold_params.tar.gz\"\n",
")\n",
"ALPHAFOLD_PARAMS_PATH = os.path.join(\n",
" ALPHAFOLD_PARAMS_DIR, os.path.basename(ALPHAFOLD_PARAM_SOURCE_URL)\n",
")\n",
......@@ -173,10 +175,7 @@
" %shell rm \"{ALPHAFOLD_PARAMS_PATH}\"\n",
"\n",
" %shell mkdir --parents \"{OPENFOLD_PARAMS_DIR}\"\n",
" %shell gdown --id \"{OPENFOLD_PARAM_FILE_ID}\" -O \"{OPENFOLD_PARAMS_PATH}\"\n",
" %shell tar --extract --verbose --file=\"{OPENFOLD_PARAMS_PATH}\" \\\n",
" --directory=\"{OPENFOLD_PARAMS_DIR}\" --preserve-permissions\n",
" %shell rm \"{OPENFOLD_PARAMS_PATH}\"\n",
" %shell git clone \"{OPENFOLD_PARAM_SOURCE_URL}\" \"{OPENFOLD_PARAMS_DIR}\"\n",
" pbar.update(55)\n",
"except subprocess.CalledProcessError:\n",
" print(captured)\n",
......@@ -472,7 +471,6 @@
" of_model_name = f\"finetuning_{model_name_spl[-1]}.pt\"\n",
" params_name = os.path.join(\n",
" OPENFOLD_PARAMS_DIR,\n",
" \"openfold_params\",\n",
" of_model_name\n",
" )\n",
" d = torch.load(params_name)\n",
......
......@@ -264,7 +264,7 @@ config = mlc.ConfigDict(
"fixed_size": True,
"subsample_templates": False, # We want top templates.
"masked_msa_replace_fraction": 0.15,
"max_msa_clusters": 128,
"max_msa_clusters": 512,
"max_extra_msa": 1024,
"max_template_hits": 4,
"max_templates": 4,
......
......@@ -162,10 +162,7 @@ def prep_output(out, batch, feature_dict, feature_processor, args):
return unrelaxed_protein
def generate_batch(fasta_file, fasta_dir, alignment_dir, data_processor, feature_processor, prediction_dir):
with open(os.path.join(fasta_dir, fasta_file), "r") as fp:
data = fp.read()
def parse_fasta(data):
lines = [
l.replace('\n', '')
for prot in data.split('>') for l in prot.strip().split('\n', 1)
......@@ -173,25 +170,20 @@ def generate_batch(fasta_file, fasta_dir, alignment_dir, data_processor, feature
tags, seqs = lines[::2], lines[1::2]
tags = [t.split()[0] for t in tags]
# assert len(tags) == len(set(tags)), "All FASTA tags must be unique"
tag = '-'.join(tags)
output_name = f'{tag}_{args.config_preset}'
if args.output_postfix is not None:
output_name = f'{output_name}_{args.output_postfix}'
return tags, seqs
# Save the unrelaxed PDB.
unrelaxed_output_path = os.path.join(
prediction_dir, f'{output_name}_unrelaxed.pdb'
)
if os.path.exists(unrelaxed_output_path):
return
precompute_alignments(tags, seqs, alignment_dir, args)
def generate_feature_dict(
tags,
seqs,
alignment_dir,
data_processor,
args,
):
tmp_fasta_path = os.path.join(args.output_dir, f"tmp_{os.getpid()}.fasta")
if len(seqs) == 1:
tag = tags[0]
seq = seqs[0]
with open(tmp_fasta_path, "w") as fp:
fp.write(f">{tag}\n{seq}")
......@@ -212,10 +204,7 @@ def generate_batch(fasta_file, fasta_dir, alignment_dir, data_processor, feature
# Remove temporary FASTA file
os.remove(tmp_fasta_path)
processed_feature_dict = feature_processor.process_features(
feature_dict, mode='predict',
)
return processed_feature_dict, tag, feature_dict
return feature_dict
def load_models_from_command_line(args, config):
......@@ -232,7 +221,12 @@ def load_models_from_command_line(args, config):
logger.info(
f"Successfully loaded JAX parameters at {args.jax_param_path}..."
)
yield model, None
model_version = os.path.basename(
os.path.normpath(args.jax_param_path),
)
model_version = os.path.splitext(model_version)[0]
yield model, model_version
if args.openfold_checkpoint_path:
for path in args.openfold_checkpoint_path.split(","):
model = AlphaFold(config)
......@@ -264,11 +258,14 @@ def load_models_from_command_line(args, config):
# The public weights have had this done to them already
d = d["ema"]["params"]
model.load_state_dict(d)
model = model.to(args.model_device)
logger.info(
f"Loaded OpenFold parameters at {args.openfold_checkpoint_path}..."
)
yield model, checkpoint_basename
if not args.jax_param_path and not args.openfold_checkpoint_path:
raise ValueError(
"At least one of jax_param_path or openfold_checkpoint_path must "
......@@ -311,24 +308,41 @@ def main(args):
os.makedirs(prediction_dir, exist_ok=True)
for fasta_file in os.listdir(args.fasta_dir):
with open(os.path.join(args.fasta_dir, fasta_file), "r") as fp:
data = fp.read()
batch_data = generate_batch(
fasta_file,
args.fasta_dir,
alignment_dir,
data_processor,
feature_processor,
prediction_dir)
tags, seqs = parse_fasta(data)
# assert len(tags) == len(set(tags)), "All FASTA tags must be unique"
tag = '-'.join(tags)
if batch_data is None:
# this file has already been processed
output_name = f'{tag}_{args.config_preset}'
if args.output_postfix is not None:
output_name = f'{output_name}_{args.output_postfix}'
unrelaxed_output_path = os.path.join(
prediction_dir, f'{output_name}_unrelaxed.pdb'
)
# Output already exists
if os.path.exists(unrelaxed_output_path):
continue
batch, tag, feature_dict = batch_data
precompute_alignments(tags, seqs, alignment_dir, args)
for model, model_version in load_models_from_command_line(args, config):
feature_dict = generate_feature_dict(
tags,
seqs,
alignment_dir,
data_processor,
args,
)
working_batch = deepcopy(batch)
processed_feature_dict = feature_processor.process_features(
feature_dict, mode='predict',
)
for model, model_version in load_models_from_command_line(args, config):
working_batch = deepcopy(processed_feature_dict)
out = run_model(model, working_batch, tag, args)
# Toss out the recycling dimensions --- we don't need them anymore
......@@ -339,21 +353,11 @@ def main(args):
out, working_batch, feature_dict, feature_processor, args
)
output_name = f'{tag}_{args.config_preset}'
if model_version is not None:
output_name = f'{output_name}_{model_version}'
if args.output_postfix is not None:
output_name = f'{output_name}_{args.output_postfix}'
# Save the unrelaxed PDB.
unrelaxed_output_path = os.path.join(
prediction_dir, f'{output_name}_unrelaxed.pdb'
)
with open(unrelaxed_output_path, 'w') as fp:
fp.write(protein.to_pdb(unrelaxed_protein))
logger.info(f"Output written to {unrelaxed_output_path}...")
if not args.skip_relaxation:
amber_relaxer = relax.AmberRelaxation(
use_gpu=(args.model_device != "cpu"),
......@@ -377,6 +381,7 @@ def main(args):
)
with open(relaxed_output_path, 'w') as fp:
fp.write(relaxed_pdb_str)
logger.info(f"Relaxed output written to {relaxed_output_path}...")
if args.save_outputs:
......@@ -388,6 +393,7 @@ def main(args):
logger.info(f"Model output written to {output_dict_path}...")
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument(
......@@ -413,8 +419,7 @@ if __name__ == "__main__":
)
parser.add_argument(
"--config_preset", type=str, default="model_1",
help="""Name of a model config. Choose one of model_{1-5} or
model_{1-5}_ptm, as defined on the AlphaFold GitHub."""
help="""Name of a model config preset defined in openfold/config.py"""
)
parser.add_argument(
"--jax_param_path", type=str, default=None,
......
......@@ -14,9 +14,10 @@
# See the License for the specific language governing permissions and
# limitations under the License.
#
# Downloads and unzips OpenFold parameters.
# Downloads and unzips OpenFold parameters from Google Drive. Alternative to
# the HuggingFace version.
#
# Usage: bash download_openfold_params.sh /path/to/download/directory
# Usage: bash download_openfold_params_gdrive.sh /path/to/download/directory
set -e
if [[ $# -eq 0 ]]; then
......
#!/bin/bash
#
# Copyright 2021 DeepMind Technologies Limited
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
# Downloads and unzips OpenFold parameters.
#
# Usage: bash download_openfold_params_huggingface.sh /path/to/download/directory
set -e
if [[ $# -eq 0 ]]; then
echo "Error: download directory must be provided as an input argument."
exit 1
fi
URL="https://huggingface.co/nz/OpenFold"
DOWNLOAD_DIR="${1}/openfold_params/"
mkdir -p "${DOWNLOAD_DIR}"
git clone $URL "${DOWNLOAD_DIR}"
rm -rf "${DOWNLOAD_DIR}/.git"
......@@ -32,7 +32,7 @@ mkdir -p tests/test_data/alphafold/common
ln -rs openfold/resources/stereo_chemical_props.txt tests/test_data/alphafold/common
echo "Downloading OpenFold parameters..."
bash scripts/download_openfold_params.sh openfold/resources
bash scripts/download_openfold_params_huggingface.sh openfold/resources
echo "Downloading AlphaFold parameters..."
bash scripts/download_alphafold_params.sh openfold/resources
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment