Add MMseqs2 MSA generation

36f317be · Gustaf · 99361481 · 36f317be · 36f317be · 36f317be
Commit 36f317be authored Nov 09, 2021 by Gustaf
13 changed files
--- a/README.md
+++ b/README.md
@@ -60,12 +60,32 @@ To install the HH-suite to `/usr/bin`, run
 ## Usage
-To download the genetic databases used by AlphaFold/OpenFold, run:
+To download DeepMind's pretrained parameters and common ground truth data, run:
 ```bash
-scripts/download_all_data.sh data/
+scripts/download_data.sh data/
 ```
+You have two choices for downloading protein databases, depending on whether 
+you want to use DeepMind's MSA generation pipeline (w/ HMMR & HHblits) or 
+[ColabFold](https://github.com/sokrypton/ColabFold)'s, which uses the faster
+[MMseqs2](https://github.com/soedinglab/mmseqs2) instead. For the former, run:
+```bash
+scripts/download_alphafold_databases.sh data/
+```
+For the latter, run:
+```bash
+scripts/download_mmseqs_databases.sh data/    # downloads .tar files
+scripts/prep_mmseqs_databases.sh data/        # unpacks and preps the databases
+```
+Make sure to run the latter command on the machine that will be used for MSA
+generation (the script estimates how the precomputed database index used by
+MMseqs2 should be split according to the memory available on the system).
 Alternatively, you can use raw MSAs from 
 [ProteinNet](https://github.com/aqlaboratory/proteinnet). After downloading
 the database, use `scripts/prepare_proteinnet_msas.py` to convert the data into
@@ -74,7 +94,8 @@ a format recognized by the OpenFold parser. The resulting directory becomes the
 ### Inference
-To run inference on a sequence `target.fasta` (e.g., `wget https://www.rcsb.org/fasta/entry/4DSN`) using a set of DeepMind's pretrained parameters, run e.g.
+To run inference on a sequence using a set of DeepMind's pretrained parameters, 
+run e.g.:
 ```bash
 python3 run_pretrained_openfold.py \
@@ -93,17 +114,25 @@ python3 run_pretrained_openfold.py \
    --kalign_binary_path lib/conda/envs/openfold_venv/bin/kalign
 ```
-where `data` is the same directory as in the previous step. If `jackhmmer`, `hhblits`, `hhsearch` and `kalign` are available at the default path of `/usr/bin`, their `binary_path` command-line arguments can be dropped. 
+where `data` is the same directory as in the previous step. If `jackhmmer`, 
+`hhblits`, `hhsearch` and `kalign` are available at the default path of 
+`/usr/bin`, their `binary_path` command-line arguments can be dropped.
+If you've already computed alignments for the query (see "Training"), you have 
+the option to circumvent the expensive alignment computation here.
 ### Training
-After activating the OpenFold environment with `source scripts/activate_conda_env.sh`, install OpenFold by running
+After activating the OpenFold environment with 
+`source scripts/activate_conda_env.sh`, install OpenFold by running
 ```bash
 python setup.py install
 ```
-To train the model, you will first need to precompute protein alignments. Create `mmcif_dir/` and download `.cif` files from the PDB (e.g., `wget https://files.rcsb.org/download/4DSN.cif`). Then run:
+To train the model, you will first need to precompute protein alignments. 
+You have two options. You can use the same procedure DeepMind used by running
+the following:
 ```bash
 python3 scripts/precompute_alignments.py mmcif_dir/ alignment_dir/ \
@@ -119,9 +148,28 @@ python3 scripts/precompute_alignments.py mmcif_dir/ alignment_dir/ \
    --hhsearch_binary_path lib/conda/envs/openfold_venv/bin/hhsearch \
    --kalign_binary_path lib/conda/envs/openfold_venv/bin/kalign
 ```
 As noted before, you can skip the `binary_path` arguments if these binaries are at `/usr/bin`.
 Expect this step to take a very long time, even for small numbers of proteins.
+Alternatively, you can generate MSAs with the ColabFold pipeline (and templates
+with HHsearch) with:
+```bash
+python3 scripts/precompute_alignments_mmseqs.py input.fasta \
+    data/mmseqs_dbs \
+    uniref30_2103_db \
+    output_dir \
+    ~/MMseqs2/build/bin/mmseqs \
+    /usr/bin/hhsearch \
+    --env_db colabfold_envdb_202108_db
+    --pdb70 data/pdb70/pdb70
+```
+where `input.fasta` is a FASTA file containing one or more query sequences. To 
+generate an input FASTA from a directory of mmCIF files, we provide
+`scripts/mmcif_dir_to_fasta.py`.
 Next, generate a cache of certain datapoints in the mmCIF files:
 ```bash

--- a/run_pretrained_openfold.py
+++ b/run_pretrained_openfold.py
@@ -49,7 +49,6 @@ def main(args):
    import_jax_weights_(model, args.param_path)
    model = model.to(args.model_device)
-    # FEATURE COLLECTION AND PROCESSING
    template_featurizer = templates.TemplateHitFeaturizer(
        mmcif_dir=args.template_mmcif_dir,
        max_template_date=args.max_template_date,
@@ -61,20 +60,6 @@ def main(args):
    use_small_bfd=(args.bfd_database_path is None)
-    alignment_runner = data_pipeline.AlignmentRunner(
-        jackhmmer_binary_path=args.jackhmmer_binary_path,
-        hhblits_binary_path=args.hhblits_binary_path,
-        hhsearch_binary_path=args.hhsearch_binary_path,
-        uniref90_database_path=args.uniref90_database_path,
-        mgnify_database_path=args.mgnify_database_path,
-        bfd_database_path=args.bfd_database_path,
-        uniclust30_database_path=args.uniclust30_database_path,
-        small_bfd_database_path=args.small_bfd_database_path,
-        pdb70_database_path=args.pdb70_database_path,
-        use_small_bfd=use_small_bfd,
-        no_cpus=args.cpus,
-    )
    data_processor = data_pipeline.DataPipeline(
        template_featurizer=template_featurizer,
    )
@@ -91,9 +76,26 @@ def main(args):
        os.makedirs(alignment_dir)
    logging.info("Generating features...")
-    alignment_runner.run(
-        args.fasta_path, alignment_dir
+    if(args.use_precomputed_alignments is None):
-    )     
+        alignment_runner = data_pipeline.AlignmentRunner(
+            jackhmmer_binary_path=args.jackhmmer_binary_path,
+            hhblits_binary_path=args.hhblits_binary_path,
+            hhsearch_binary_path=args.hhsearch_binary_path,
+            uniref90_database_path=args.uniref90_database_path,
+            mgnify_database_path=args.mgnify_database_path,
+            bfd_database_path=args.bfd_database_path,
+            uniclust30_database_path=args.uniclust30_database_path,
+            small_bfd_database_path=args.small_bfd_database_path,
+            pdb70_database_path=args.pdb70_database_path,
+            use_small_bfd=use_small_bfd,
+            no_cpus=args.cpus,
+        )
+        alignment_runner.run(
+            args.fasta_path, alignment_dir
+        )
+    else:
+        alignment_dir = args.use_precomputed_alignments
    feature_dict = data_processor.process_fasta(
        fasta_path=args.fasta_path, alignment_dir=alignment_dir
@@ -155,6 +157,10 @@ if __name__ == "__main__":
        "fasta_path", type=str,
    )
    add_data_args(parser)
+    parser.add_argument(
+        "--use_precomputed_alignments", type=str, default=None,
+        help="""Path to alignment directory. If provided, alignment computation 
+                is skipped and database path arguments are ignored."""
    parser.add_argument(
        "--output_dir", type=str, default=os.getcwd(),
        help="""Name of the directory in which to output the prediction""",

--- a/scripts/colabfold_search.sh
+++ b/scripts/colabfold_search.sh
+# Copied from colabfold.mmseqs.com
+#!/bin/bash -e
+MMSEQS="$1"
+QUERY="$2"
+DBBASE="$3"
+BASE="$4"
+DB1="$5"
+DB2="$6"
+DB3="$7"
+USE_ENV="${8:-1}"
+USE_TEMPLATES="${9:-0}"
+FILTER="${10:-1}"
+INDEX=${11:-1}
+DB_LOAD_MODE="${12:-2}"
+EXPAND_EVAL=inf
+ALIGN_EVAL=10
+DIFF=3000
+QSC=-20.0
+MAX_ACCEPT=1000000
+if [ "${FILTER}" = "1" ]; then
+# 0.1 was not used in benchmarks due to POSIX shell bug in line above
+#  EXPAND_EVAL=0.1
+  ALIGN_EVAL=10
+  QSC=0.8
+  MAX_ACCEPT=100000
+fi
+if [ "${INDEX}" = "1" ]; then
+  SEQ=".idx"
+  ALN=".idx"
+  IDX=".idx"
+else
+  SEQ="_seq"
+  ALN="_aln"
+  IDX=""
+  export MMSEQS_IGNORE_INDEX=1
+fi
+export MMSEQS_CALL_DEPTH=1
+SEARCH_PARAM="--num-iterations 3 --db-load-mode ${DB_LOAD_MODE} -a -s 8 -e 0.1 --max-seqs 10000"
+FILTER_PARAM="--filter-msa ${FILTER} --filter-min-enable 1000 --diff ${DIFF} --qid 0.0,0.2,0.4,0.6,0.8,1.0 --qsc 0 --max-seq-id 0.95"
+EXPAND_PARAM="--expansion-mode 0 -e ${EXPAND_EVAL} --expand-filter-clusters ${FILTER} --max-seq-id 0.95"
+mkdir -p "${BASE}"
+"${MMSEQS}" createdb "${QUERY}" "${BASE}/qdb"
+"${MMSEQS}" search "${BASE}/qdb" "${DBBASE}/${DB1}" "${BASE}/res" "${BASE}/tmp" $SEARCH_PARAM
+"${MMSEQS}" expandaln "${BASE}/qdb" "${DBBASE}/${DB1}${SEQ}" "${BASE}/res" "${DBBASE}/${DB1}${ALN}" "${BASE}/res_exp" --db-load-mode ${DB_LOAD_MODE} ${EXPAND_PARAM}
+"${MMSEQS}" mvdb "${BASE}/tmp/latest/profile_1" "${BASE}/prof_res"
+"${MMSEQS}" lndb "${BASE}/qdb_h" "${BASE}/prof_res_h"
+"${MMSEQS}" align "${BASE}/prof_res" "${DBBASE}/${DB1}${SEQ}" "${BASE}/res_exp" "${BASE}/res_exp_realign" --db-load-mode ${DB_LOAD_MODE} -e ${ALIGN_EVAL} --max-accept ${MAX_ACCEPT} --alt-ali 10 -a
+"${MMSEQS}" filterresult "${BASE}/qdb" "${DBBASE}/${DB1}${SEQ}" "${BASE}/res_exp_realign" "${BASE}/res_exp_realign_filter" --db-load-mode ${DB_LOAD_MODE} --qid 0 --qsc $QSC --diff 0 --max-seq-id 1.0 --filter-min-enable 100
+"${MMSEQS}" result2msa "${BASE}/qdb" "${DBBASE}/${DB1}${SEQ}" "${BASE}/res_exp_realign_filter" "${BASE}/uniref.a3m" --msa-format-mode 6 --db-load-mode ${DB_LOAD_MODE} ${FILTER_PARAM}
+"${MMSEQS}" rmdb "${BASE}/res_exp_realign"
+"${MMSEQS}" rmdb "${BASE}/res_exp"
+"${MMSEQS}" rmdb "${BASE}/res"
+"${MMSEQS}" rmdb "${BASE}/res_exp_realign_filter"
+if [ "${USE_TEMPLATES}" = "1" ]; then
+  "${MMSEQS}" search "${BASE}/prof_res" "${DBBASE}/${DB2}" "${BASE}/res_pdb" "${BASE}/tmp" --db-load-mode ${DB_LOAD_MODE} -s 7.5 -a -e 0.1
+  "${MMSEQS}" convertalis "${BASE}/prof_res" "${DBBASE}/${DB2}${IDX}" "${BASE}/res_pdb" "${BASE}/${DB2}.m8" --format-output query,target,fident,alnlen,mismatch,gapopen,qstart,qend,tstart,tend,evalue,bits,cigar --db-load-mode ${DB_LOAD_MODE}
+  "${MMSEQS}" rmdb "${BASE}/res_pdb"
+fi
+if [ "${USE_ENV}" = "1" ]; then
+  "${MMSEQS}" search "${BASE}/prof_res" "${DBBASE}/${DB3}" "${BASE}/res_env" "${BASE}/tmp" $SEARCH_PARAM
+  "${MMSEQS}" expandaln "${BASE}/prof_res" "${DBBASE}/${DB3}${SEQ}" "${BASE}/res_env" "${DBBASE}/${DB3}${ALN}" "${BASE}/res_env_exp" -e ${EXPAND_EVAL} --expansion-mode 0 --db-load-mode ${DB_LOAD_MODE}
+  "${MMSEQS}" align "${BASE}/tmp/latest/profile_1" "${DBBASE}/${DB3}${SEQ}" "${BASE}/res_env_exp" "${BASE}/res_env_exp_realign" --db-load-mode ${DB_LOAD_MODE} -e ${ALIGN_EVAL} --max-accept ${MAX_ACCEPT} --alt-ali 10 -a
+  "${MMSEQS}" filterresult "${BASE}/qdb" "${DBBASE}/${DB3}${SEQ}" "${BASE}/res_env_exp_realign" "${BASE}/res_env_exp_realign_filter" --db-load-mode ${DB_LOAD_MODE} --qid 0 --qsc $QSC --diff 0 --max-seq-id 1.0 --filter-min-enable 100
+  "${MMSEQS}" result2msa "${BASE}/qdb" "${DBBASE}/${DB3}${SEQ}" "${BASE}/res_env_exp_realign_filter" "${BASE}/bfd.mgnify30.metaeuk30.smag30.a3m" --msa-format-mode 6 --db-load-mode ${DB_LOAD_MODE} ${FILTER_PARAM}
+  "${MMSEQS}" rmdb "${BASE}/res_env_exp_realign_filter"
+  "${MMSEQS}" rmdb "${BASE}/res_env_exp_realign"
+  "${MMSEQS}" rmdb "${BASE}/res_env_exp"
+  "${MMSEQS}" rmdb "${BASE}/res_env"
+fi
+"${MMSEQS}" rmdb "${BASE}/qdb"
+"${MMSEQS}" rmdb "${BASE}/qdb_h"
+"${MMSEQS}" rmdb "${BASE}/res"
+rm -f -- "${BASE}/prof_res"*
+rm -rf -- "${BASE}/tmp"
--- a/scripts/download_all_data.sh
+++ b/scripts/download_all_data.sh
@@ -39,9 +39,6 @@ fi
 SCRIPT_DIR="$(dirname "$(realpath "$0")")"
-echo "Downloading AlphaFold parameters..."
-bash "${SCRIPT_DIR}/download_alphafold_params.sh" "${DOWNLOAD_DIR}"
 if [[ "${DOWNLOAD_MODE}" = full_dbs ]] ; then
  echo "Downloading BFD..."
  bash "${SCRIPT_DIR}/download_bfd.sh" "${DOWNLOAD_DIR}"
@@ -53,12 +50,6 @@ fi
 echo "Downloading MGnify..."
 bash "${SCRIPT_DIR}/download_mgnify.sh" "${DOWNLOAD_DIR}"
-echo "Downloading PDB70..."
-bash "${SCRIPT_DIR}/download_pdb70.sh" "${DOWNLOAD_DIR}"
-echo "Downloading PDB mmCIF files..."
-bash "${SCRIPT_DIR}/download_pdb_mmcif.sh" "${DOWNLOAD_DIR}"
 echo "Downloading Uniclust30..."
 bash "${SCRIPT_DIR}/download_uniclust30.sh" "${DOWNLOAD_DIR}"

--- a/scripts/download_colabfold_envdb.sh
+++ b/scripts/download_colabfold_envdb.sh
+#!/bin/bash
+#
+# Copyright 2021 AlQuraishi Laboratory
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#      http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+# Downloads and unzips the BFD database for AlphaFold.
+#
+# Usage: bash download_bfd.sh /path/to/download/directory
+set -e
+if [[ $# -eq 0 ]]; then
+    echo "Error: download directory must be provided as an input argument."
+    exit 1
+fi
+if ! command -v aria2c &> /dev/null ; then
+    echo "Error: aria2c could not be found. Please install aria2c (sudo apt install aria2)."
+    exit 1
+fi
+DOWNLOAD_DIR="$1"
+ROOT_DIR="${DOWNLOAD_DIR}"
+SOURCE_URL="http://wwwuser.gwdg.de/~compbiol/colabfold/colabfold_envdb_202108.tar.gz"
+BASENAME=$(basename "${SOURCE_URL}")
+mkdir --parents "${ROOT_DIR}"
+aria2c "${SOURCE_URL}" --dir="${ROOT_DIR}" -x 4
--- a/scripts/download_data.sh
+++ b/scripts/download_data.sh
+#!/bin/bash
+#
+# Copyright 2021 DeepMind Technologies Limited
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#      http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+# Downloads and unzips all required data for AlphaFold.
+#
+# Usage: bash download_all_data.sh /path/to/download/directory
+set -e
+if [[ $# -eq 0 ]]; then
+    echo "Error: download directory must be provided as an input argument."
+    exit 1
+fi
+if ! command -v aria2c &> /dev/null ; then
+    echo "Error: aria2c could not be found. Please install aria2c (sudo apt install aria2)."
+    exit 1
+fi
+DOWNLOAD_DIR="$1"
+DOWNLOAD_MODE="${2:-full_dbs}" # Default mode to full_dbs.
+if [[ "${DOWNLOAD_MODE}" != full_dbs && "${DOWNLOAD_MODE}" != reduced_dbs ]]
+then
+  echo "DOWNLOAD_MODE ${DOWNLOAD_MODE} not recognized."
+  exit 1
+fi
+SCRIPT_DIR="$(dirname "$(realpath "$0")")"
+echo "Downloading AlphaFold parameters..."
+bash "${SCRIPT_DIR}/download_alphafold_params.sh" "${DOWNLOAD_DIR}"
+echo "Downloading PDB70..."
+bash "${SCRIPT_DIR}/download_pdb70.sh" "${DOWNLOAD_DIR}"
+echo "Downloading PDB mmCIF files..."
+bash "${SCRIPT_DIR}/download_pdb_mmcif.sh" "${DOWNLOAD_DIR}"
+echo "All data downloaded."
--- a/scripts/download_mmseqs_dbs.sh
+++ b/scripts/download_mmseqs_dbs.sh
+#!/bin/bash
+#
+# Copyright 2021 AlQuraishi Laboratory 
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#      http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+# Downloads and unzips all required data for AlphaFold.
+#
+# Usage: bash download_all_data.sh /path/to/download/directory
+set -e
+if [[ $# -eq 0 ]]; then
+    echo "Error: download directory must be provided as an input argument."
+    exit 1
+fi
+if ! command -v aria2c &> /dev/null ; then
+    echo "Error: aria2c could not be found. Please install aria2c (sudo apt install aria2)."
+    exit 1
+fi
+DOWNLOAD_DIR="$1"
+DOWNLOAD_MODE="${2:-full_dbs}" # Default mode to full_dbs.
+if [[ "${DOWNLOAD_MODE}" != full_dbs && "${DOWNLOAD_MODE}" != reduced_dbs ]]
+then
+  echo "DOWNLOAD_MODE ${DOWNLOAD_MODE} not recognized."
+  exit 1
+fi
+SCRIPT_DIR="$(dirname "$(realpath "$0")")"
+echo "Downloading Uniref30..."
+bash "${SCRIPT_DIR}/download_uniref30.sh" "${DOWNLOAD_DIR}"
+echo "Downloading ColabFold's environmental database..."
+bash "${SCRIPT_DIR}/download_colabfold_envdb.sh" "${DOWNLOAD_DIR}"
+echo "All data downloaded."
--- a/scripts/download_uniref30.sh
+++ b/scripts/download_uniref30.sh
+#!/bin/bash
+#
+# Copyright 2021 AlQuraishi Laboratory
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#      http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+# Downloads and unzips the BFD database for AlphaFold.
+#
+# Usage: bash download_bfd.sh /path/to/download/directory
+set -e
+if [[ $# -eq 0 ]]; then
+    echo "Error: download directory must be provided as an input argument."
+    exit 1
+fi
+if ! command -v aria2c &> /dev/null ; then
+    echo "Error: aria2c could not be found. Please install aria2c (sudo apt install aria2)."
+    exit 1
+fi
+DOWNLOAD_DIR="$1"
+ROOT_DIR="${DOWNLOAD_DIR}"
+SOURCE_URL="http://wwwuser.gwdg.de/~compbiol/colabfold/uniref30_2103.tar.gz"
+BASENAME=$(basename "${SOURCE_URL}")
+mkdir --parents "${ROOT_DIR}"
+aria2c "${SOURCE_URL}" --dir="${ROOT_DIR}" -x 4
--- a/scripts/mmcif_dir_to_fasta.py
+++ b/scripts/mmcif_dir_to_fasta.py
+import argparse
+import logging
+import os
+from openfold.data import mmcif_parsing
+def main(args):
+    fasta = []
+    for fname in os.listdir(args.mmcif_dir):
+        basename, ext = os.path.splitext(fname)
+        basename = basename.upper()
+        if(not ext == ".cif"):
+            continue
+        fpath = os.path.join(args.mmcif_dir, fname)
+        with open(fpath, 'r') as fp:
+            mmcif_str = fp.read()
+        mmcif = mmcif_parsing.parse(
+            file_id=basename, mmcif_string=mmcif_str
+        )
+        if(mmcif.mmcif_object is None):
+            logging.warning(f'Failed to parse {fname}...')
+            if(args.raise_errors):
+                raise list(mmcif.errors.values())[0]
+            else:
+                continue
+        mmcif = mmcif.mmcif_object
+        for chain, seq in mmcif.chain_to_seqres.items():
+            chain_id = '_'.join([basename, chain])
+            fasta.append(f">{chain_id}")
+            fasta.append(seq)
+    with open(args.output_path, "w") as fp:
+        fp.write('\n'.join(fasta))
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "mmcif_dir", type=str,
+        help="Path to a directory containing mmCIF files"
+    )
+    parser.add_argument(
+        "output_path", type=str,
+        help="Path to output FASTA file"
+    )
+    parser.add_argument(
+        "--raise_errors", type=bool, default=False,
+        help="Whether to crash on parsing errors"
+    )
+    args = parser.parse_args()
+    main(args)
--- a/scripts/precompute_alignments_mmseqs.py
+++ b/scripts/precompute_alignments_mmseqs.py
+import argparse
+import logging
+import os
+from pathlib import Path
+import subprocess
+from openfold.data.tools import hhsearch
+def _split_a3ms(output_dir):
+    for fname in os.listdir(output_dir):
+        if(not os.path.splitext(fname)[-1] == ".a3m"):
+            continue
+        fpath = os.path.join(output_dir, fname)
+        with open(fpath, "r") as fp:
+            a3ms = fp.read()
+        # Split by the null byte, excluding the terminating null byte
+        a3ms = a3ms.split('\x00')[:-1]
+        for a3m in a3ms:
+            name = a3m.split('\n', 1)[0][1:]
+            prot_dir = os.path.join(output_dir, name)
+            Path(prot_dir).mkdir(parents=True, exist_ok=True)
+            with open(os.path.join(prot_dir, fname), "w") as fp:
+                fp.write(a3m)
+        os.remove(fpath)
+        os.remove(fpath + ".dbtype")
+        os.remove(fpath + ".index")
+def main(args):
+    with open(args.input_fasta, "r") as f:
+        lines = [l.strip() for l in f.readlines()]
+    names = lines[::2]
+    seqs =  lines[1::2]
+    if(args.fasta_chunk_size is None):
+        chunk_size = len(seqs)
+    else:
+        chunk_size = args.fasta_chunk_size
+    # Make the output directory
+    Path(args.output_dir).mkdir(parents=True, exist_ok=True)
+    s = 0
+    while(s < len(seqs)):
+        e = s + chunk_size
+        chunk_fasta = [el for tup in zip(names[s:e], seqs[s:e]) for el in tup] 
+        s = e
+        prot_dir = os.path.join(args.output_dir, chunk_fasta[0][:1].upper())
+        if(os.path.exists(prot_dir)):
+            # We've already computed this chunk
+            continue
+        chunk_fasta_path = os.path.join(args.output_dir, "tmp.fasta")
+        with open(chunk_fasta_path, "w") as f:
+            f.write('\n'.join(chunk_fasta) + '\n')
+        cmd = [
+            "scripts/colabfold_search.sh",
+            args.mmseqs_binary_path,
+            chunk_fasta_path,
+            args.mmseqs_db_dir,
+            args.output_dir,
+            args.uniref_db,
+            '""',
+            '""' if args.env_db is None else args.env_db,
+            "0" if args.env_db is None else "1",
+            "0", # compute templates
+            "1", # filter
+            "1", # use precomputed index 
+            "0", # db-load-mode
+        ]
+        logging.info('Launching subprocess "%s"', " ".join(cmd))
+        process = subprocess.Popen(
+            cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE
+        )
+        stdout, stderr = process.communicate()
+        retcode = process.wait()
+        if retcode:
+            raise RuntimeError(
+                "MMseqs failed\nstdout:\n%s\n\nstderr:\n%s\n"
+                % (stdout.decode("utf-8"), stderr.decode("utf-8"))
+            )
+        _split_a3ms(args.output_dir)
+        # Clean up temporary files
+        os.remove(chunk_fasta_path)
+    hhsearch_pdb70_runner = hhsearch.HHSearch(
+        binary_path=args.hhsearch_binary_path, databases=[args.pdb70]
+    )
+    for d in os.listdir(args.output_dir):
+        dpath = os.path.join(args.output_dir, d)
+        if(not os.path.isdir(dpath)):
+            continue
+        for fname in os.listdir(dpath):
+            fpath = os.path.join(dpath, fname)
+            if(not "uniref" in fname or 
+                not os.path.splitext(fname)[-1] == ".a3m"):
+                continue
+            print(fpath)
+            with open(fpath, "r") as fp:
+                a3m = fp.read()
+            hhsearch_result = hhsearch_pdb70_runner.query(a3m)
+            pdb70_out_path = os.path.join(dpath, "pdb70_hits.hhr")
+            with open(pdb70_out_path, "w") as f:
+                f.write(hhsearch_result)
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "input_fasta", type=str,
+        help="Path to input FASTA file. Can contain one or more sequences."
+    )
+    parser.add_argument(
+        "mmseqs_db_dir", type=str,
+        help="""Path to directory containing pre-processed MMSeqs2 DBs 
+                (see README)"""
+    )
+    parser.add_argument(
+        "uniref_db", type=str,
+        help="Basename of uniref database"
+    )
+    parser.add_argument(
+        "output_dir", type=str,
+        help="Output directory"
+    )
+    parser.add_argument(
+        "mmseqs_binary_path", type=str,
+        help="Path to mmseqs binary"
+    )
+    parser.add_argument(
+        "--hhsearch_binary_path", type=str, default=None,
+        help="""Path to hhsearch binary (for template search). In future 
+                versions, we'll also use mmseqs for this"""
+    )
+    parser.add_argument(
+        "--pdb70", type=str, default=None,
+        help="Basename of the pdb70 database"
+    )
+    parser.add_argument(
+        "--env_db", type=str, default=None,
+        help="Basename of environmental database"
+    )
+    parser.add_argument(
+        "--fasta_chunk_size", type=int, default=None,
+        help="""How many sequences should be processed at once. All sequences 
+                processed at once by default."""
+    )
+    args = parser.parse_args()
+    if(args.hhsearch_binary_path is not None and args.pdb70 is None):
+        raise ValueError(
+            "pdb70 must be specified along with hhsearch_binary_path"
+        )
+    main(args)
--- a/scripts/prep_mmseqs_dbs.sh
+++ b/scripts/prep_mmseqs_dbs.sh
+#!/bin/bash
+#
+# Copyright 2021 AlQuraishi Laboratory 
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#      http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+# Downloads and unzips all required data for AlphaFold.
+#
+# Usage: bash download_all_data.sh /path/to/download/directory
+set -e
+DOWNLOAD_DIR="$1"
+for f in $(ls ${DOWNLOAD_DIR}/*.tar.gz)
+do
+  tar --extract --verbose --file="${DOWNLOAD_DIR}/${f}" \
+      --directory="${DOWNLOAD_DIR}/mmseqs_dbs"
+  rm "${f}"
+  BASENAME="$(basename {f%%.*})"
+  DB_NAME="${BASENAME}_db"
+  OLD_PWD=$(pwd)
+  cd "${DOWNLOAD_DIR}/mmseqs_dbs" 
+  mmseqs tar2exprofiledb "${BASENAME}" "${DB_NAME}"
+  mmseqs createindex "${DB_NAME}" "${DOWNLOAD_DIR}/tmp/"
+  cd "${OLD_PWD}"
+done
--- a/tests/test_data/sample_feats.pickle.gz
+++ b/tests/test_data/sample_feats.pickle.gz
--- a/train_openfold.py
+++ b/train_openfold.py
@@ -2,7 +2,7 @@ import argparse
 import logging
 import os
-#os.environ["CUDA_VISIBLE_DEVICES"] = "7"
+#os.environ["CUDA_VISIBLE_DEVICES"] = "6"
 #os.environ["MASTER_ADDR"]="10.119.81.14"
 #os.environ["MASTER_PORT"]="42069"
 #os.environ["NODE_RANK"]="0"
@@ -114,6 +114,7 @@ def main(args):
        train=True, 
        low_prec=(args.precision == 16)
    ) 
    model_module = OpenFoldWrapper(config)
    if(args.resume_from_ckpt and args.resume_model_weights_only):
        sd = get_fp32_state_dict_from_zero_checkpoint(args.resume_from_ckpt)
@@ -126,6 +127,7 @@ def main(args):
        batch_seed=args.seed,
        **vars(args)
    )
    data_module.prepare_data()
    data_module.setup()