Commit 36f317be authored by Gustaf's avatar Gustaf
Browse files

Add MMseqs2 MSA generation

parent 99361481
...@@ -60,12 +60,32 @@ To install the HH-suite to `/usr/bin`, run ...@@ -60,12 +60,32 @@ To install the HH-suite to `/usr/bin`, run
## Usage ## Usage
To download the genetic databases used by AlphaFold/OpenFold, run: To download DeepMind's pretrained parameters and common ground truth data, run:
```bash ```bash
scripts/download_all_data.sh data/ scripts/download_data.sh data/
``` ```
You have two choices for downloading protein databases, depending on whether
you want to use DeepMind's MSA generation pipeline (w/ HMMR & HHblits) or
[ColabFold](https://github.com/sokrypton/ColabFold)'s, which uses the faster
[MMseqs2](https://github.com/soedinglab/mmseqs2) instead. For the former, run:
```bash
scripts/download_alphafold_databases.sh data/
```
For the latter, run:
```bash
scripts/download_mmseqs_databases.sh data/ # downloads .tar files
scripts/prep_mmseqs_databases.sh data/ # unpacks and preps the databases
```
Make sure to run the latter command on the machine that will be used for MSA
generation (the script estimates how the precomputed database index used by
MMseqs2 should be split according to the memory available on the system).
Alternatively, you can use raw MSAs from Alternatively, you can use raw MSAs from
[ProteinNet](https://github.com/aqlaboratory/proteinnet). After downloading [ProteinNet](https://github.com/aqlaboratory/proteinnet). After downloading
the database, use `scripts/prepare_proteinnet_msas.py` to convert the data into the database, use `scripts/prepare_proteinnet_msas.py` to convert the data into
...@@ -74,7 +94,8 @@ a format recognized by the OpenFold parser. The resulting directory becomes the ...@@ -74,7 +94,8 @@ a format recognized by the OpenFold parser. The resulting directory becomes the
### Inference ### Inference
To run inference on a sequence `target.fasta` (e.g., `wget https://www.rcsb.org/fasta/entry/4DSN`) using a set of DeepMind's pretrained parameters, run e.g. To run inference on a sequence using a set of DeepMind's pretrained parameters,
run e.g.:
```bash ```bash
python3 run_pretrained_openfold.py \ python3 run_pretrained_openfold.py \
...@@ -93,17 +114,25 @@ python3 run_pretrained_openfold.py \ ...@@ -93,17 +114,25 @@ python3 run_pretrained_openfold.py \
--kalign_binary_path lib/conda/envs/openfold_venv/bin/kalign --kalign_binary_path lib/conda/envs/openfold_venv/bin/kalign
``` ```
where `data` is the same directory as in the previous step. If `jackhmmer`, `hhblits`, `hhsearch` and `kalign` are available at the default path of `/usr/bin`, their `binary_path` command-line arguments can be dropped. where `data` is the same directory as in the previous step. If `jackhmmer`,
`hhblits`, `hhsearch` and `kalign` are available at the default path of
`/usr/bin`, their `binary_path` command-line arguments can be dropped.
If you've already computed alignments for the query (see "Training"), you have
the option to circumvent the expensive alignment computation here.
### Training ### Training
After activating the OpenFold environment with `source scripts/activate_conda_env.sh`, install OpenFold by running After activating the OpenFold environment with
`source scripts/activate_conda_env.sh`, install OpenFold by running
```bash ```bash
python setup.py install python setup.py install
``` ```
To train the model, you will first need to precompute protein alignments. Create `mmcif_dir/` and download `.cif` files from the PDB (e.g., `wget https://files.rcsb.org/download/4DSN.cif`). Then run: To train the model, you will first need to precompute protein alignments.
You have two options. You can use the same procedure DeepMind used by running
the following:
```bash ```bash
python3 scripts/precompute_alignments.py mmcif_dir/ alignment_dir/ \ python3 scripts/precompute_alignments.py mmcif_dir/ alignment_dir/ \
...@@ -119,9 +148,28 @@ python3 scripts/precompute_alignments.py mmcif_dir/ alignment_dir/ \ ...@@ -119,9 +148,28 @@ python3 scripts/precompute_alignments.py mmcif_dir/ alignment_dir/ \
--hhsearch_binary_path lib/conda/envs/openfold_venv/bin/hhsearch \ --hhsearch_binary_path lib/conda/envs/openfold_venv/bin/hhsearch \
--kalign_binary_path lib/conda/envs/openfold_venv/bin/kalign --kalign_binary_path lib/conda/envs/openfold_venv/bin/kalign
``` ```
As noted before, you can skip the `binary_path` arguments if these binaries are at `/usr/bin`. As noted before, you can skip the `binary_path` arguments if these binaries are at `/usr/bin`.
Expect this step to take a very long time, even for small numbers of proteins. Expect this step to take a very long time, even for small numbers of proteins.
Alternatively, you can generate MSAs with the ColabFold pipeline (and templates
with HHsearch) with:
```bash
python3 scripts/precompute_alignments_mmseqs.py input.fasta \
data/mmseqs_dbs \
uniref30_2103_db \
output_dir \
~/MMseqs2/build/bin/mmseqs \
/usr/bin/hhsearch \
--env_db colabfold_envdb_202108_db
--pdb70 data/pdb70/pdb70
```
where `input.fasta` is a FASTA file containing one or more query sequences. To
generate an input FASTA from a directory of mmCIF files, we provide
`scripts/mmcif_dir_to_fasta.py`.
Next, generate a cache of certain datapoints in the mmCIF files: Next, generate a cache of certain datapoints in the mmCIF files:
```bash ```bash
......
...@@ -49,7 +49,6 @@ def main(args): ...@@ -49,7 +49,6 @@ def main(args):
import_jax_weights_(model, args.param_path) import_jax_weights_(model, args.param_path)
model = model.to(args.model_device) model = model.to(args.model_device)
# FEATURE COLLECTION AND PROCESSING
template_featurizer = templates.TemplateHitFeaturizer( template_featurizer = templates.TemplateHitFeaturizer(
mmcif_dir=args.template_mmcif_dir, mmcif_dir=args.template_mmcif_dir,
max_template_date=args.max_template_date, max_template_date=args.max_template_date,
...@@ -61,20 +60,6 @@ def main(args): ...@@ -61,20 +60,6 @@ def main(args):
use_small_bfd=(args.bfd_database_path is None) use_small_bfd=(args.bfd_database_path is None)
alignment_runner = data_pipeline.AlignmentRunner(
jackhmmer_binary_path=args.jackhmmer_binary_path,
hhblits_binary_path=args.hhblits_binary_path,
hhsearch_binary_path=args.hhsearch_binary_path,
uniref90_database_path=args.uniref90_database_path,
mgnify_database_path=args.mgnify_database_path,
bfd_database_path=args.bfd_database_path,
uniclust30_database_path=args.uniclust30_database_path,
small_bfd_database_path=args.small_bfd_database_path,
pdb70_database_path=args.pdb70_database_path,
use_small_bfd=use_small_bfd,
no_cpus=args.cpus,
)
data_processor = data_pipeline.DataPipeline( data_processor = data_pipeline.DataPipeline(
template_featurizer=template_featurizer, template_featurizer=template_featurizer,
) )
...@@ -91,9 +76,26 @@ def main(args): ...@@ -91,9 +76,26 @@ def main(args):
os.makedirs(alignment_dir) os.makedirs(alignment_dir)
logging.info("Generating features...") logging.info("Generating features...")
alignment_runner.run(
args.fasta_path, alignment_dir if(args.use_precomputed_alignments is None):
) alignment_runner = data_pipeline.AlignmentRunner(
jackhmmer_binary_path=args.jackhmmer_binary_path,
hhblits_binary_path=args.hhblits_binary_path,
hhsearch_binary_path=args.hhsearch_binary_path,
uniref90_database_path=args.uniref90_database_path,
mgnify_database_path=args.mgnify_database_path,
bfd_database_path=args.bfd_database_path,
uniclust30_database_path=args.uniclust30_database_path,
small_bfd_database_path=args.small_bfd_database_path,
pdb70_database_path=args.pdb70_database_path,
use_small_bfd=use_small_bfd,
no_cpus=args.cpus,
)
alignment_runner.run(
args.fasta_path, alignment_dir
)
else:
alignment_dir = args.use_precomputed_alignments
feature_dict = data_processor.process_fasta( feature_dict = data_processor.process_fasta(
fasta_path=args.fasta_path, alignment_dir=alignment_dir fasta_path=args.fasta_path, alignment_dir=alignment_dir
...@@ -155,6 +157,10 @@ if __name__ == "__main__": ...@@ -155,6 +157,10 @@ if __name__ == "__main__":
"fasta_path", type=str, "fasta_path", type=str,
) )
add_data_args(parser) add_data_args(parser)
parser.add_argument(
"--use_precomputed_alignments", type=str, default=None,
help="""Path to alignment directory. If provided, alignment computation
is skipped and database path arguments are ignored."""
parser.add_argument( parser.add_argument(
"--output_dir", type=str, default=os.getcwd(), "--output_dir", type=str, default=os.getcwd(),
help="""Name of the directory in which to output the prediction""", help="""Name of the directory in which to output the prediction""",
......
# Copied from colabfold.mmseqs.com
#!/bin/bash -e
MMSEQS="$1"
QUERY="$2"
DBBASE="$3"
BASE="$4"
DB1="$5"
DB2="$6"
DB3="$7"
USE_ENV="${8:-1}"
USE_TEMPLATES="${9:-0}"
FILTER="${10:-1}"
INDEX=${11:-1}
DB_LOAD_MODE="${12:-2}"
EXPAND_EVAL=inf
ALIGN_EVAL=10
DIFF=3000
QSC=-20.0
MAX_ACCEPT=1000000
if [ "${FILTER}" = "1" ]; then
# 0.1 was not used in benchmarks due to POSIX shell bug in line above
# EXPAND_EVAL=0.1
ALIGN_EVAL=10
QSC=0.8
MAX_ACCEPT=100000
fi
if [ "${INDEX}" = "1" ]; then
SEQ=".idx"
ALN=".idx"
IDX=".idx"
else
SEQ="_seq"
ALN="_aln"
IDX=""
export MMSEQS_IGNORE_INDEX=1
fi
export MMSEQS_CALL_DEPTH=1
SEARCH_PARAM="--num-iterations 3 --db-load-mode ${DB_LOAD_MODE} -a -s 8 -e 0.1 --max-seqs 10000"
FILTER_PARAM="--filter-msa ${FILTER} --filter-min-enable 1000 --diff ${DIFF} --qid 0.0,0.2,0.4,0.6,0.8,1.0 --qsc 0 --max-seq-id 0.95"
EXPAND_PARAM="--expansion-mode 0 -e ${EXPAND_EVAL} --expand-filter-clusters ${FILTER} --max-seq-id 0.95"
mkdir -p "${BASE}"
"${MMSEQS}" createdb "${QUERY}" "${BASE}/qdb"
"${MMSEQS}" search "${BASE}/qdb" "${DBBASE}/${DB1}" "${BASE}/res" "${BASE}/tmp" $SEARCH_PARAM
"${MMSEQS}" expandaln "${BASE}/qdb" "${DBBASE}/${DB1}${SEQ}" "${BASE}/res" "${DBBASE}/${DB1}${ALN}" "${BASE}/res_exp" --db-load-mode ${DB_LOAD_MODE} ${EXPAND_PARAM}
"${MMSEQS}" mvdb "${BASE}/tmp/latest/profile_1" "${BASE}/prof_res"
"${MMSEQS}" lndb "${BASE}/qdb_h" "${BASE}/prof_res_h"
"${MMSEQS}" align "${BASE}/prof_res" "${DBBASE}/${DB1}${SEQ}" "${BASE}/res_exp" "${BASE}/res_exp_realign" --db-load-mode ${DB_LOAD_MODE} -e ${ALIGN_EVAL} --max-accept ${MAX_ACCEPT} --alt-ali 10 -a
"${MMSEQS}" filterresult "${BASE}/qdb" "${DBBASE}/${DB1}${SEQ}" "${BASE}/res_exp_realign" "${BASE}/res_exp_realign_filter" --db-load-mode ${DB_LOAD_MODE} --qid 0 --qsc $QSC --diff 0 --max-seq-id 1.0 --filter-min-enable 100
"${MMSEQS}" result2msa "${BASE}/qdb" "${DBBASE}/${DB1}${SEQ}" "${BASE}/res_exp_realign_filter" "${BASE}/uniref.a3m" --msa-format-mode 6 --db-load-mode ${DB_LOAD_MODE} ${FILTER_PARAM}
"${MMSEQS}" rmdb "${BASE}/res_exp_realign"
"${MMSEQS}" rmdb "${BASE}/res_exp"
"${MMSEQS}" rmdb "${BASE}/res"
"${MMSEQS}" rmdb "${BASE}/res_exp_realign_filter"
if [ "${USE_TEMPLATES}" = "1" ]; then
"${MMSEQS}" search "${BASE}/prof_res" "${DBBASE}/${DB2}" "${BASE}/res_pdb" "${BASE}/tmp" --db-load-mode ${DB_LOAD_MODE} -s 7.5 -a -e 0.1
"${MMSEQS}" convertalis "${BASE}/prof_res" "${DBBASE}/${DB2}${IDX}" "${BASE}/res_pdb" "${BASE}/${DB2}.m8" --format-output query,target,fident,alnlen,mismatch,gapopen,qstart,qend,tstart,tend,evalue,bits,cigar --db-load-mode ${DB_LOAD_MODE}
"${MMSEQS}" rmdb "${BASE}/res_pdb"
fi
if [ "${USE_ENV}" = "1" ]; then
"${MMSEQS}" search "${BASE}/prof_res" "${DBBASE}/${DB3}" "${BASE}/res_env" "${BASE}/tmp" $SEARCH_PARAM
"${MMSEQS}" expandaln "${BASE}/prof_res" "${DBBASE}/${DB3}${SEQ}" "${BASE}/res_env" "${DBBASE}/${DB3}${ALN}" "${BASE}/res_env_exp" -e ${EXPAND_EVAL} --expansion-mode 0 --db-load-mode ${DB_LOAD_MODE}
"${MMSEQS}" align "${BASE}/tmp/latest/profile_1" "${DBBASE}/${DB3}${SEQ}" "${BASE}/res_env_exp" "${BASE}/res_env_exp_realign" --db-load-mode ${DB_LOAD_MODE} -e ${ALIGN_EVAL} --max-accept ${MAX_ACCEPT} --alt-ali 10 -a
"${MMSEQS}" filterresult "${BASE}/qdb" "${DBBASE}/${DB3}${SEQ}" "${BASE}/res_env_exp_realign" "${BASE}/res_env_exp_realign_filter" --db-load-mode ${DB_LOAD_MODE} --qid 0 --qsc $QSC --diff 0 --max-seq-id 1.0 --filter-min-enable 100
"${MMSEQS}" result2msa "${BASE}/qdb" "${DBBASE}/${DB3}${SEQ}" "${BASE}/res_env_exp_realign_filter" "${BASE}/bfd.mgnify30.metaeuk30.smag30.a3m" --msa-format-mode 6 --db-load-mode ${DB_LOAD_MODE} ${FILTER_PARAM}
"${MMSEQS}" rmdb "${BASE}/res_env_exp_realign_filter"
"${MMSEQS}" rmdb "${BASE}/res_env_exp_realign"
"${MMSEQS}" rmdb "${BASE}/res_env_exp"
"${MMSEQS}" rmdb "${BASE}/res_env"
fi
"${MMSEQS}" rmdb "${BASE}/qdb"
"${MMSEQS}" rmdb "${BASE}/qdb_h"
"${MMSEQS}" rmdb "${BASE}/res"
rm -f -- "${BASE}/prof_res"*
rm -rf -- "${BASE}/tmp"
...@@ -39,9 +39,6 @@ fi ...@@ -39,9 +39,6 @@ fi
SCRIPT_DIR="$(dirname "$(realpath "$0")")" SCRIPT_DIR="$(dirname "$(realpath "$0")")"
echo "Downloading AlphaFold parameters..."
bash "${SCRIPT_DIR}/download_alphafold_params.sh" "${DOWNLOAD_DIR}"
if [[ "${DOWNLOAD_MODE}" = full_dbs ]] ; then if [[ "${DOWNLOAD_MODE}" = full_dbs ]] ; then
echo "Downloading BFD..." echo "Downloading BFD..."
bash "${SCRIPT_DIR}/download_bfd.sh" "${DOWNLOAD_DIR}" bash "${SCRIPT_DIR}/download_bfd.sh" "${DOWNLOAD_DIR}"
...@@ -53,12 +50,6 @@ fi ...@@ -53,12 +50,6 @@ fi
echo "Downloading MGnify..." echo "Downloading MGnify..."
bash "${SCRIPT_DIR}/download_mgnify.sh" "${DOWNLOAD_DIR}" bash "${SCRIPT_DIR}/download_mgnify.sh" "${DOWNLOAD_DIR}"
echo "Downloading PDB70..."
bash "${SCRIPT_DIR}/download_pdb70.sh" "${DOWNLOAD_DIR}"
echo "Downloading PDB mmCIF files..."
bash "${SCRIPT_DIR}/download_pdb_mmcif.sh" "${DOWNLOAD_DIR}"
echo "Downloading Uniclust30..." echo "Downloading Uniclust30..."
bash "${SCRIPT_DIR}/download_uniclust30.sh" "${DOWNLOAD_DIR}" bash "${SCRIPT_DIR}/download_uniclust30.sh" "${DOWNLOAD_DIR}"
......
#!/bin/bash
#
# Copyright 2021 AlQuraishi Laboratory
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
# Downloads and unzips the BFD database for AlphaFold.
#
# Usage: bash download_bfd.sh /path/to/download/directory
set -e
if [[ $# -eq 0 ]]; then
echo "Error: download directory must be provided as an input argument."
exit 1
fi
if ! command -v aria2c &> /dev/null ; then
echo "Error: aria2c could not be found. Please install aria2c (sudo apt install aria2)."
exit 1
fi
DOWNLOAD_DIR="$1"
ROOT_DIR="${DOWNLOAD_DIR}"
SOURCE_URL="http://wwwuser.gwdg.de/~compbiol/colabfold/colabfold_envdb_202108.tar.gz"
BASENAME=$(basename "${SOURCE_URL}")
mkdir --parents "${ROOT_DIR}"
aria2c "${SOURCE_URL}" --dir="${ROOT_DIR}" -x 4
#!/bin/bash
#
# Copyright 2021 DeepMind Technologies Limited
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
# Downloads and unzips all required data for AlphaFold.
#
# Usage: bash download_all_data.sh /path/to/download/directory
set -e
if [[ $# -eq 0 ]]; then
echo "Error: download directory must be provided as an input argument."
exit 1
fi
if ! command -v aria2c &> /dev/null ; then
echo "Error: aria2c could not be found. Please install aria2c (sudo apt install aria2)."
exit 1
fi
DOWNLOAD_DIR="$1"
DOWNLOAD_MODE="${2:-full_dbs}" # Default mode to full_dbs.
if [[ "${DOWNLOAD_MODE}" != full_dbs && "${DOWNLOAD_MODE}" != reduced_dbs ]]
then
echo "DOWNLOAD_MODE ${DOWNLOAD_MODE} not recognized."
exit 1
fi
SCRIPT_DIR="$(dirname "$(realpath "$0")")"
echo "Downloading AlphaFold parameters..."
bash "${SCRIPT_DIR}/download_alphafold_params.sh" "${DOWNLOAD_DIR}"
echo "Downloading PDB70..."
bash "${SCRIPT_DIR}/download_pdb70.sh" "${DOWNLOAD_DIR}"
echo "Downloading PDB mmCIF files..."
bash "${SCRIPT_DIR}/download_pdb_mmcif.sh" "${DOWNLOAD_DIR}"
echo "All data downloaded."
#!/bin/bash
#
# Copyright 2021 AlQuraishi Laboratory
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
# Downloads and unzips all required data for AlphaFold.
#
# Usage: bash download_all_data.sh /path/to/download/directory
set -e
if [[ $# -eq 0 ]]; then
echo "Error: download directory must be provided as an input argument."
exit 1
fi
if ! command -v aria2c &> /dev/null ; then
echo "Error: aria2c could not be found. Please install aria2c (sudo apt install aria2)."
exit 1
fi
DOWNLOAD_DIR="$1"
DOWNLOAD_MODE="${2:-full_dbs}" # Default mode to full_dbs.
if [[ "${DOWNLOAD_MODE}" != full_dbs && "${DOWNLOAD_MODE}" != reduced_dbs ]]
then
echo "DOWNLOAD_MODE ${DOWNLOAD_MODE} not recognized."
exit 1
fi
SCRIPT_DIR="$(dirname "$(realpath "$0")")"
echo "Downloading Uniref30..."
bash "${SCRIPT_DIR}/download_uniref30.sh" "${DOWNLOAD_DIR}"
echo "Downloading ColabFold's environmental database..."
bash "${SCRIPT_DIR}/download_colabfold_envdb.sh" "${DOWNLOAD_DIR}"
echo "All data downloaded."
#!/bin/bash
#
# Copyright 2021 AlQuraishi Laboratory
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
# Downloads and unzips the BFD database for AlphaFold.
#
# Usage: bash download_bfd.sh /path/to/download/directory
set -e
if [[ $# -eq 0 ]]; then
echo "Error: download directory must be provided as an input argument."
exit 1
fi
if ! command -v aria2c &> /dev/null ; then
echo "Error: aria2c could not be found. Please install aria2c (sudo apt install aria2)."
exit 1
fi
DOWNLOAD_DIR="$1"
ROOT_DIR="${DOWNLOAD_DIR}"
SOURCE_URL="http://wwwuser.gwdg.de/~compbiol/colabfold/uniref30_2103.tar.gz"
BASENAME=$(basename "${SOURCE_URL}")
mkdir --parents "${ROOT_DIR}"
aria2c "${SOURCE_URL}" --dir="${ROOT_DIR}" -x 4
import argparse
import logging
import os
from openfold.data import mmcif_parsing
def main(args):
fasta = []
for fname in os.listdir(args.mmcif_dir):
basename, ext = os.path.splitext(fname)
basename = basename.upper()
if(not ext == ".cif"):
continue
fpath = os.path.join(args.mmcif_dir, fname)
with open(fpath, 'r') as fp:
mmcif_str = fp.read()
mmcif = mmcif_parsing.parse(
file_id=basename, mmcif_string=mmcif_str
)
if(mmcif.mmcif_object is None):
logging.warning(f'Failed to parse {fname}...')
if(args.raise_errors):
raise list(mmcif.errors.values())[0]
else:
continue
mmcif = mmcif.mmcif_object
for chain, seq in mmcif.chain_to_seqres.items():
chain_id = '_'.join([basename, chain])
fasta.append(f">{chain_id}")
fasta.append(seq)
with open(args.output_path, "w") as fp:
fp.write('\n'.join(fasta))
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument(
"mmcif_dir", type=str,
help="Path to a directory containing mmCIF files"
)
parser.add_argument(
"output_path", type=str,
help="Path to output FASTA file"
)
parser.add_argument(
"--raise_errors", type=bool, default=False,
help="Whether to crash on parsing errors"
)
args = parser.parse_args()
main(args)
import argparse
import logging
import os
from pathlib import Path
import subprocess
from openfold.data.tools import hhsearch
def _split_a3ms(output_dir):
for fname in os.listdir(output_dir):
if(not os.path.splitext(fname)[-1] == ".a3m"):
continue
fpath = os.path.join(output_dir, fname)
with open(fpath, "r") as fp:
a3ms = fp.read()
# Split by the null byte, excluding the terminating null byte
a3ms = a3ms.split('\x00')[:-1]
for a3m in a3ms:
name = a3m.split('\n', 1)[0][1:]
prot_dir = os.path.join(output_dir, name)
Path(prot_dir).mkdir(parents=True, exist_ok=True)
with open(os.path.join(prot_dir, fname), "w") as fp:
fp.write(a3m)
os.remove(fpath)
os.remove(fpath + ".dbtype")
os.remove(fpath + ".index")
def main(args):
with open(args.input_fasta, "r") as f:
lines = [l.strip() for l in f.readlines()]
names = lines[::2]
seqs = lines[1::2]
if(args.fasta_chunk_size is None):
chunk_size = len(seqs)
else:
chunk_size = args.fasta_chunk_size
# Make the output directory
Path(args.output_dir).mkdir(parents=True, exist_ok=True)
s = 0
while(s < len(seqs)):
e = s + chunk_size
chunk_fasta = [el for tup in zip(names[s:e], seqs[s:e]) for el in tup]
s = e
prot_dir = os.path.join(args.output_dir, chunk_fasta[0][:1].upper())
if(os.path.exists(prot_dir)):
# We've already computed this chunk
continue
chunk_fasta_path = os.path.join(args.output_dir, "tmp.fasta")
with open(chunk_fasta_path, "w") as f:
f.write('\n'.join(chunk_fasta) + '\n')
cmd = [
"scripts/colabfold_search.sh",
args.mmseqs_binary_path,
chunk_fasta_path,
args.mmseqs_db_dir,
args.output_dir,
args.uniref_db,
'""',
'""' if args.env_db is None else args.env_db,
"0" if args.env_db is None else "1",
"0", # compute templates
"1", # filter
"1", # use precomputed index
"0", # db-load-mode
]
logging.info('Launching subprocess "%s"', " ".join(cmd))
process = subprocess.Popen(
cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE
)
stdout, stderr = process.communicate()
retcode = process.wait()
if retcode:
raise RuntimeError(
"MMseqs failed\nstdout:\n%s\n\nstderr:\n%s\n"
% (stdout.decode("utf-8"), stderr.decode("utf-8"))
)
_split_a3ms(args.output_dir)
# Clean up temporary files
os.remove(chunk_fasta_path)
hhsearch_pdb70_runner = hhsearch.HHSearch(
binary_path=args.hhsearch_binary_path, databases=[args.pdb70]
)
for d in os.listdir(args.output_dir):
dpath = os.path.join(args.output_dir, d)
if(not os.path.isdir(dpath)):
continue
for fname in os.listdir(dpath):
fpath = os.path.join(dpath, fname)
if(not "uniref" in fname or
not os.path.splitext(fname)[-1] == ".a3m"):
continue
print(fpath)
with open(fpath, "r") as fp:
a3m = fp.read()
hhsearch_result = hhsearch_pdb70_runner.query(a3m)
pdb70_out_path = os.path.join(dpath, "pdb70_hits.hhr")
with open(pdb70_out_path, "w") as f:
f.write(hhsearch_result)
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument(
"input_fasta", type=str,
help="Path to input FASTA file. Can contain one or more sequences."
)
parser.add_argument(
"mmseqs_db_dir", type=str,
help="""Path to directory containing pre-processed MMSeqs2 DBs
(see README)"""
)
parser.add_argument(
"uniref_db", type=str,
help="Basename of uniref database"
)
parser.add_argument(
"output_dir", type=str,
help="Output directory"
)
parser.add_argument(
"mmseqs_binary_path", type=str,
help="Path to mmseqs binary"
)
parser.add_argument(
"--hhsearch_binary_path", type=str, default=None,
help="""Path to hhsearch binary (for template search). In future
versions, we'll also use mmseqs for this"""
)
parser.add_argument(
"--pdb70", type=str, default=None,
help="Basename of the pdb70 database"
)
parser.add_argument(
"--env_db", type=str, default=None,
help="Basename of environmental database"
)
parser.add_argument(
"--fasta_chunk_size", type=int, default=None,
help="""How many sequences should be processed at once. All sequences
processed at once by default."""
)
args = parser.parse_args()
if(args.hhsearch_binary_path is not None and args.pdb70 is None):
raise ValueError(
"pdb70 must be specified along with hhsearch_binary_path"
)
main(args)
#!/bin/bash
#
# Copyright 2021 AlQuraishi Laboratory
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
# Downloads and unzips all required data for AlphaFold.
#
# Usage: bash download_all_data.sh /path/to/download/directory
set -e
DOWNLOAD_DIR="$1"
for f in $(ls ${DOWNLOAD_DIR}/*.tar.gz)
do
tar --extract --verbose --file="${DOWNLOAD_DIR}/${f}" \
--directory="${DOWNLOAD_DIR}/mmseqs_dbs"
rm "${f}"
BASENAME="$(basename {f%%.*})"
DB_NAME="${BASENAME}_db"
OLD_PWD=$(pwd)
cd "${DOWNLOAD_DIR}/mmseqs_dbs"
mmseqs tar2exprofiledb "${BASENAME}" "${DB_NAME}"
mmseqs createindex "${DB_NAME}" "${DOWNLOAD_DIR}/tmp/"
cd "${OLD_PWD}"
done
...@@ -2,7 +2,7 @@ import argparse ...@@ -2,7 +2,7 @@ import argparse
import logging import logging
import os import os
#os.environ["CUDA_VISIBLE_DEVICES"] = "7" #os.environ["CUDA_VISIBLE_DEVICES"] = "6"
#os.environ["MASTER_ADDR"]="10.119.81.14" #os.environ["MASTER_ADDR"]="10.119.81.14"
#os.environ["MASTER_PORT"]="42069" #os.environ["MASTER_PORT"]="42069"
#os.environ["NODE_RANK"]="0" #os.environ["NODE_RANK"]="0"
...@@ -114,6 +114,7 @@ def main(args): ...@@ -114,6 +114,7 @@ def main(args):
train=True, train=True,
low_prec=(args.precision == 16) low_prec=(args.precision == 16)
) )
model_module = OpenFoldWrapper(config) model_module = OpenFoldWrapper(config)
if(args.resume_from_ckpt and args.resume_model_weights_only): if(args.resume_from_ckpt and args.resume_model_weights_only):
sd = get_fp32_state_dict_from_zero_checkpoint(args.resume_from_ckpt) sd = get_fp32_state_dict_from_zero_checkpoint(args.resume_from_ckpt)
...@@ -126,6 +127,7 @@ def main(args): ...@@ -126,6 +127,7 @@ def main(args):
batch_seed=args.seed, batch_seed=args.seed,
**vars(args) **vars(args)
) )
data_module.prepare_data() data_module.prepare_data()
data_module.setup() data_module.setup()
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment