Merge remote-tracking branch 'origin/main' into run-multiple-models

2ada4f8d · Sam DeLuca · 07b522b7 · c871ccf3 · 2ada4f8d · 2ada4f8d
Commit 2ada4f8d authored Jun 23, 2022 by Sam DeLuca
8 changed files
--- a/README.md
+++ b/README.md
@@ -26,7 +26,6 @@ OpenFold also supports inference using AlphaFold's official parameters.

 OpenFold has the following advantages over the reference implementation:

- **Faster inference** on GPU for chains with < 1500 residues.
 - **Inference on extremely long chains**, made possible by our implementation of low-memory attention 
 ([Rabe & Staats 2021](https://arxiv.org/pdf/2112.05682.pdf)). OpenFold can predict the structures of
  sequences with more than 4000 residues on a single A100, and even longer ones with CPU offloading.
@@ -35,17 +34,19 @@ kernels support in-place attention during inference and training. They use
 4x and 5x less GPU memory than equivalent FastFold and stock PyTorch 
 implementations, respectively.
 - **Efficient alignment scripts** using the original AlphaFold HHblits/JackHMMER pipeline or [ColabFold](https://github.com/sokrypton/ColabFold)'s, which uses the faster MMseqs2 instead. We've used them to generate millions of alignments.
+- **Faster inference** on GPU for short chains.

 ## Installation (Linux)

 All Python dependencies are specified in `environment.yml`. For producing sequence 
 alignments, you'll also need `kalign`, the [HH-suite](https://github.com/soedinglab/hh-suite), 
 and one of {`jackhmmer`, [MMseqs2](https://github.com/soedinglab/mmseqs2) (nightly build)} 
-installed on on your system. Finally, some download scripts require `aria2c`.
+installed on on your system. You'll need `git-lfs` to download OpenFold parameters. 
+Finally, some download scripts require `aria2c`.

 For convenience, we provide a script that installs Miniconda locally, creates a 
 `conda` virtual environment, installs all Python dependencies, and downloads
-useful resources (including DeepMind's pretrained parameters). Run:
+useful resources, including both sets of model parameters. Run:

 ```bash
 scripts/install_third_party_dependencies.sh
@@ -302,6 +303,10 @@ multi-node distributed training, validation, and so on. For more information,
 consult PyTorch Lightning documentation and the `--help` flag of the training 
 script.

+If you're using your own MSAs or MSAs from the RODA repository, make sure that
+the `alignment_dir` contains one directory per chain and that each of these
+contains alignments (.sto, .a3m, and .hhr) corresponding to that chain.
+
 Note that, despite its variable name, `mmcif_dir` can also contain PDB files 
 or even ProteinNet .core files. To emulate the AlphaFold training procedure, 
 which uses a self-distillation set subject to special preprocessing steps, use

--- a/notebooks/OpenFold.ipynb
+++ b/notebooks/OpenFold.ipynb
@@ -4,7 +4,7 @@
  "metadata": {
    "accelerator": "GPU",
    "colab": {
-      "name": "Copy of OpenFold.ipynb",
+      "name": "OpenFold.ipynb",
      "provenance": [],
      "collapsed_sections": []
    },
@@ -31,7 +31,7 @@
        "\n",
        "OpenFold is a trainable PyTorch reimplementation of AlphaFold 2. For the purposes of inference, it is practically identical to the original (\"practically\" because ensembling is excluded from OpenFold (recycling is enabled, however)).\n",
        "\n",
-        "In this notebook, OpenFold is run with DeepMind's publicly released parameters for AlphaFold 2.\n",
+        "In this notebook, OpenFold is run with your choice of our original OpenFold parameters or DeepMind's publicly released parameters for AlphaFold 2.\n",
        "\n",
        "**Note**\n",
        "\n",
@@ -43,7 +43,7 @@
        "\n",
        "**Licenses**\n",
        "\n",
-        "This Colab uses the [AlphaFold model parameters](https://github.com/deepmind/alphafold/#model-parameters-license), made available under the Creative Commons Attribution 4.0 International ([CC BY 4.0](https://creativecommons.org/licenses/by/4.0/legalcode)) license. The Colab itself is provided under the [Apache 2.0 license](https://www.apache.org/licenses/LICENSE-2.0). See the full license statement below.\n",
+        "This Colab supports inference with the [AlphaFold model parameters](https://github.com/deepmind/alphafold/#model-parameters-license), made available under the Creative Commons Attribution 4.0 International ([CC BY 4.0](https://creativecommons.org/licenses/by/4.0/legalcode)) license. The Colab itself is provided under the [Apache 2.0 license](https://www.apache.org/licenses/LICENSE-2.0). See the full license statement below.\n",
        "\n",
        "**More information**\n",
        "\n",
@@ -111,6 +111,11 @@
        "      %shell wget -q -P /content \\\n",
        "        https://git.scicore.unibas.ch/schwede/openstructure/-/raw/7102c63615b64735c4941278d92b554ec94415f8/modules/mol/alg/src/stereo_chemical_props.txt\n",
        "      pbar.update(1)\n",
+        "\n",
+        "      # Install git-lfs\n",
+        "      %shell sudo apt-get install git-lfs\n",
+        "      %shell git lfs install\n",
+        "\n",
        "except subprocess.CalledProcessError as captured:\n",
        "  print(captured)\n",
        "  raise"
@@ -132,13 +137,10 @@
        "\n",
        "GIT_REPO = 'https://github.com/aqlaboratory/openfold'\n",
        "\n",
-        "OPENFOLD_PARAM_FILE_ID = \"1OpeMrfWEUSD_KqffbPqd5p7WsJjlC3ZE\"\n",
+        "OPENFOLD_PARAM_SOURCE_URL = \"https://huggingface.co/nz/OpenFold\"\n",
        "ALPHAFOLD_PARAM_SOURCE_URL = 'https://storage.googleapis.com/alphafold/alphafold_params_2022-01-19.tar'\n",
-        "OPENFOLD_PARAMS_DIR = './openfold/openfold/resources/'\n",
+        "OPENFOLD_PARAMS_DIR = './openfold/openfold/resources/openfold_params'\n",
        "ALPHAFOLD_PARAMS_DIR = './openfold/openfold/resources/params'\n",
-        "OPENFOLD_PARAMS_PATH = os.path.join(\n",
-        "  OPENFOLD_PARAMS_DIR, \"openfold_params.tar.gz\"\n",
-        ")\n",
        "ALPHAFOLD_PARAMS_PATH = os.path.join(\n",
        "  ALPHAFOLD_PARAMS_DIR, os.path.basename(ALPHAFOLD_PARAM_SOURCE_URL)\n",
        ")\n",
@@ -173,10 +175,7 @@
        "      %shell rm \"{ALPHAFOLD_PARAMS_PATH}\"\n",
        "\n",
        "      %shell mkdir --parents \"{OPENFOLD_PARAMS_DIR}\"\n",
-        "      %shell gdown --id \"{OPENFOLD_PARAM_FILE_ID}\" -O \"{OPENFOLD_PARAMS_PATH}\"\n",
-        "      %shell tar --extract --verbose --file=\"{OPENFOLD_PARAMS_PATH}\" \\\n",
-        "        --directory=\"{OPENFOLD_PARAMS_DIR}\" --preserve-permissions\n",
-        "      %shell rm \"{OPENFOLD_PARAMS_PATH}\"\n",
+        "      %shell git clone \"{OPENFOLD_PARAM_SOURCE_URL}\" \"{OPENFOLD_PARAMS_DIR}\"\n",
        "      pbar.update(55)\n",
        "except subprocess.CalledProcessError:\n",
        "  print(captured)\n",
@@ -472,7 +471,6 @@
        "        of_model_name = f\"finetuning_{model_name_spl[-1]}.pt\"\n",
        "      params_name = os.path.join(\n",
        "        OPENFOLD_PARAMS_DIR,\n",
-        "        \"openfold_params\",\n",
        "        of_model_name\n",
        "      )\n",
        "      d = torch.load(params_name)\n",

--- a/openfold/config.py
+++ b/openfold/config.py
@@ -264,7 +264,7 @@ config = mlc.ConfigDict(
                "fixed_size": True,
                "subsample_templates": False,  # We want top templates.
                "masked_msa_replace_fraction": 0.15,
-                "max_msa_clusters": 128,
+                "max_msa_clusters": 512,
                "max_extra_msa": 1024,
                "max_template_hits": 4,
                "max_templates": 4,

--- a/run_pretrained_openfold.py
+++ b/run_pretrained_openfold.py
@@ -162,10 +162,7 @@ def prep_output(out, batch, feature_dict, feature_processor, args):
    return unrelaxed_protein


-def generate_batch(fasta_file, fasta_dir, alignment_dir, data_processor, feature_processor, prediction_dir):
-    with open(os.path.join(fasta_dir, fasta_file), "r") as fp:
-        data = fp.read()
-
+def parse_fasta(data):
    lines = [
        l.replace('\n', '')
        for prot in data.split('>') for l in prot.strip().split('\n', 1)
@@ -173,25 +170,20 @@ def generate_batch(fasta_file, fasta_dir, alignment_dir, data_processor, feature
    tags, seqs = lines[::2], lines[1::2]

    tags = [t.split()[0] for t in tags]
-    # assert len(tags) == len(set(tags)), "All FASTA tags must be unique"
-    tag = '-'.join(tags)

-    output_name = f'{tag}_{args.config_preset}'
-    if args.output_postfix is not None:
-        output_name = f'{output_name}_{args.output_postfix}'
+    return tags, seqs

-    # Save the unrelaxed PDB.
-    unrelaxed_output_path = os.path.join(
-        prediction_dir, f'{output_name}_unrelaxed.pdb'
-    )
-
-    if os.path.exists(unrelaxed_output_path):
-        return
-
-    precompute_alignments(tags, seqs, alignment_dir, args)

+def generate_feature_dict(
+    tags,
+    seqs,
+    alignment_dir,
+    data_processor,
+    args,
+):
    tmp_fasta_path = os.path.join(args.output_dir, f"tmp_{os.getpid()}.fasta")
    if len(seqs) == 1:
+        tag = tags[0]
        seq = seqs[0]
        with open(tmp_fasta_path, "w") as fp:
            fp.write(f">{tag}\n{seq}")
@@ -212,10 +204,7 @@ def generate_batch(fasta_file, fasta_dir, alignment_dir, data_processor, feature
    # Remove temporary FASTA file
    os.remove(tmp_fasta_path)

-    processed_feature_dict = feature_processor.process_features(
-        feature_dict, mode='predict',
-    )
-    return processed_feature_dict, tag, feature_dict
+    return feature_dict


 def load_models_from_command_line(args, config):
@@ -232,7 +221,12 @@ def load_models_from_command_line(args, config):
            logger.info(
                f"Successfully loaded JAX parameters at {args.jax_param_path}..."
            )
-            yield model, None
+            model_version = os.path.basename(
+                os.path.normpath(args.jax_param_path),
+            )
+            model_version = os.path.splitext(model_version)[0]
+            yield model, model_version
+    
    if args.openfold_checkpoint_path:
        for path in args.openfold_checkpoint_path.split(","):
            model = AlphaFold(config)
@@ -264,11 +258,14 @@ def load_models_from_command_line(args, config):
                    # The public weights have had this done to them already
                    d = d["ema"]["params"]
                model.load_state_dict(d)
+            
            model = model.to(args.model_device)
            logger.info(
                f"Loaded OpenFold parameters at {args.openfold_checkpoint_path}..."
            )
+            
            yield model, checkpoint_basename
+    
    if not args.jax_param_path and not args.openfold_checkpoint_path:
        raise ValueError(
            "At least one of jax_param_path or openfold_checkpoint_path must "
@@ -311,24 +308,41 @@ def main(args):
    os.makedirs(prediction_dir, exist_ok=True)

    for fasta_file in os.listdir(args.fasta_dir):
+        with open(os.path.join(args.fasta_dir, fasta_file), "r") as fp:
+            data = fp.read()
    
-        batch_data = generate_batch(
-            fasta_file,
-            args.fasta_dir,
-            alignment_dir,
-            data_processor,
-            feature_processor,
-            prediction_dir)
+        tags, seqs = parse_fasta(data)
+        # assert len(tags) == len(set(tags)), "All FASTA tags must be unique"
+        tag = '-'.join(tags)
    
-        if batch_data is None:
-            # this file has already been processed
+        output_name = f'{tag}_{args.config_preset}'
+        if args.output_postfix is not None:
+            output_name = f'{output_name}_{args.output_postfix}'
+    
+        unrelaxed_output_path = os.path.join(
+            prediction_dir, f'{output_name}_unrelaxed.pdb'
+        )
+   
+        # Output already exists
+        if os.path.exists(unrelaxed_output_path):
            continue

-        batch, tag, feature_dict = batch_data
+        precompute_alignments(tags, seqs, alignment_dir, args)
    
-        for model, model_version in load_models_from_command_line(args, config):
+        feature_dict = generate_feature_dict(
+            tags,
+            seqs,
+            alignment_dir,
+            data_processor,
+            args,
+        )

-            working_batch = deepcopy(batch)
+        processed_feature_dict = feature_processor.process_features(
+            feature_dict, mode='predict',
+        )
+
+        for model, model_version in load_models_from_command_line(args, config):
+            working_batch = deepcopy(processed_feature_dict)
            out = run_model(model, working_batch, tag, args)

            # Toss out the recycling dimensions --- we don't need them anymore
@@ -339,21 +353,11 @@ def main(args):
                out, working_batch, feature_dict, feature_processor, args
            )

-            output_name = f'{tag}_{args.config_preset}'
-
-            if model_version is not None:
-                output_name = f'{output_name}_{model_version}'
-            if args.output_postfix is not None:
-                output_name = f'{output_name}_{args.output_postfix}'
-
-            # Save the unrelaxed PDB.
-            unrelaxed_output_path = os.path.join(
-                prediction_dir, f'{output_name}_unrelaxed.pdb'
-            )
            with open(unrelaxed_output_path, 'w') as fp:
                fp.write(protein.to_pdb(unrelaxed_protein))

            logger.info(f"Output written to {unrelaxed_output_path}...")
+            
            if not args.skip_relaxation:
                amber_relaxer = relax.AmberRelaxation(
                    use_gpu=(args.model_device != "cpu"),
@@ -377,6 +381,7 @@ def main(args):
                )
                with open(relaxed_output_path, 'w') as fp:
                    fp.write(relaxed_pdb_str)
+                
                logger.info(f"Relaxed output written to {relaxed_output_path}...")

            if args.save_outputs:
@@ -388,6 +393,7 @@ def main(args):

                logger.info(f"Model output written to {output_dict_path}...")

+
 if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument(
@@ -413,8 +419,7 @@ if __name__ == "__main__":
    )
    parser.add_argument(
        "--config_preset", type=str, default="model_1",
-        help="""Name of a model config. Choose one of model_{1-5} or 
-             model_{1-5}_ptm, as defined on the AlphaFold GitHub."""
+        help="""Name of a model config preset defined in openfold/config.py"""
    )
    parser.add_argument(
        "--jax_param_path", type=str, default=None,

--- a/scripts/download_openfold_params.sh
+++ b/scripts/download_openfold_params.sh
@@ -14,9 +14,10 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 #
-# Downloads and unzips OpenFold parameters.
+# Downloads and unzips OpenFold parameters from Google Drive. Alternative to
+# the HuggingFace version.
 #
-# Usage: bash download_openfold_params.sh /path/to/download/directory
+# Usage: bash download_openfold_params_gdrive.sh /path/to/download/directory
 set -e

 if [[ $# -eq 0 ]]; then

--- a/scripts/download_openfold_params_huggingface.sh
+++ b/scripts/download_openfold_params_huggingface.sh
+#!/bin/bash
+#
+# Copyright 2021 DeepMind Technologies Limited
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#      http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+# Downloads and unzips OpenFold parameters.
+#
+# Usage: bash download_openfold_params_huggingface.sh /path/to/download/directory
+set -e
+
+if [[ $# -eq 0 ]]; then
+    echo "Error: download directory must be provided as an input argument."
+    exit 1
+fi
+
+URL="https://huggingface.co/nz/OpenFold"
+
+DOWNLOAD_DIR="${1}/openfold_params/"
+mkdir -p "${DOWNLOAD_DIR}"
+git clone $URL "${DOWNLOAD_DIR}"
+rm -rf "${DOWNLOAD_DIR}/.git"
--- a/scripts/install_third_party_dependencies.sh
+++ b/scripts/install_third_party_dependencies.sh
@@ -32,7 +32,7 @@ mkdir -p tests/test_data/alphafold/common
 ln -rs openfold/resources/stereo_chemical_props.txt tests/test_data/alphafold/common

 echo "Downloading OpenFold parameters..."
-bash scripts/download_openfold_params.sh openfold/resources
+bash scripts/download_openfold_params_huggingface.sh openfold/resources

 echo "Downloading AlphaFold parameters..."
 bash scripts/download_alphafold_params.sh openfold/resources

--- a/tests/test_data/sample_feats.pickle.gz
+++ b/tests/test_data/sample_feats.pickle.gz