Commit 2c7627f3 authored by Gustaf Ahdritz's avatar Gustaf Ahdritz
Browse files

Improve RODA download process

parent ceef010a
...@@ -82,14 +82,9 @@ To install the HH-suite to `/usr/bin`, run ...@@ -82,14 +82,9 @@ To install the HH-suite to `/usr/bin`, run
## Usage ## Usage
To download the databases used to train OpenFold and AlphaFold run: If you intend to generate your own alignments, e.g. for inference, you have two
choices for downloading protein databases, depending on whether you want to use
```bash DeepMind's MSA generation pipeline (w/ HMMR & HHblits) or
bash scripts/download_data.sh data/
```
You have two choices for downloading protein databases, depending on whether
you want to use DeepMind's MSA generation pipeline (w/ HMMR & HHblits) or
[ColabFold](https://github.com/sokrypton/ColabFold)'s, which uses the faster [ColabFold](https://github.com/sokrypton/ColabFold)'s, which uses the faster
MMseqs2 instead. For the former, run: MMseqs2 instead. For the former, run:
...@@ -108,9 +103,21 @@ Make sure to run the latter command on the machine that will be used for MSA ...@@ -108,9 +103,21 @@ Make sure to run the latter command on the machine that will be used for MSA
generation (the script estimates how the precomputed database index used by generation (the script estimates how the precomputed database index used by
MMseqs2 should be split according to the memory available on the system). MMseqs2 should be split according to the memory available on the system).
Alternatively, you can use raw MSAs from our aforementioned MSA database or If you're using your own precomputed MSAs or MSAs from the RODA repository,
there's no need to download these alignment databases. Simply make sure that
the `alignment_dir` contains one directory per chain and that each of these
contains alignments (.sto, .a3m, and .hhr) corresponding to that chain. You
can use `scripts/flatten_roda.sh` to reformat RODA downloads in this way.
Note that the RODA alignments are NOT compatible with the recent .cif ground
truth files downloaded by `scripts/download_alphafold_dbs.sh`. To fetch .cif
files that match the RODA MSAs, once the alignments are flattened, use
`scripts/download_roda_pdbs.sh`. That script outputs a list of alignment dirs
for which matching .cif files could not be found. These should be removed from
the alignment directory.
Alternatively, you can use raw MSAs from
[ProteinNet](https://github.com/aqlaboratory/proteinnet). After downloading [ProteinNet](https://github.com/aqlaboratory/proteinnet). After downloading
the latter database, use `scripts/prep_proteinnet_msas.py` to convert the data that database, use `scripts/prep_proteinnet_msas.py` to convert the data
into a format recognized by the OpenFold parser. The resulting directory into a format recognized by the OpenFold parser. The resulting directory
becomes the `alignment_dir` used in subsequent steps. Use becomes the `alignment_dir` used in subsequent steps. Use
`scripts/unpack_proteinnet.py` to extract `.core` files from ProteinNet text `scripts/unpack_proteinnet.py` to extract `.core` files from ProteinNet text
...@@ -324,11 +331,6 @@ multi-node distributed training, validation, and so on. For more information, ...@@ -324,11 +331,6 @@ multi-node distributed training, validation, and so on. For more information,
consult PyTorch Lightning documentation and the `--help` flag of the training consult PyTorch Lightning documentation and the `--help` flag of the training
script. script.
If you're using your own MSAs or MSAs from the RODA repository, make sure that
the `alignment_dir` contains one directory per chain and that each of these
contains alignments (.sto, .a3m, and .hhr) corresponding to that chain. You
can use `scripts/flatten_roda.sh` to reformat RODA downloads in this way.
Note that, despite its variable name, `mmcif_dir` can also contain PDB files Note that, despite its variable name, `mmcif_dir` can also contain PDB files
or even ProteinNet .core files. To emulate the AlphaFold training procedure, or even ProteinNet .core files. To emulate the AlphaFold training procedure,
which uses a self-distillation set subject to special preprocessing steps, use which uses a self-distillation set subject to special preprocessing steps, use
......
#!/bin/bash #!/bin/bash
# #
# Copyright 2021 DeepMind Technologies Limited # Copyright 2021 AlQuraishi Laboratories
# #
# Licensed under the Apache License, Version 2.0 (the "License"); # Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License. # you may not use this file except in compliance with the License.
...@@ -14,40 +14,39 @@ ...@@ -14,40 +14,39 @@
# See the License for the specific language governing permissions and # See the License for the specific language governing permissions and
# limitations under the License. # limitations under the License.
# #
# Downloads and unzips all required data for AlphaFold. # Downloads .cif files matching the RODA alignments. Outputs a list of
# # RODA alignments for which .cif files could not be found..
# Usage: bash download_all_data.sh /path/to/download/directory if [[ $# != 2 ]]; then
set -e echo "usage: ./download_roda_pdbs.sh <out_dir> <roda_pdb_alignment_dir>"
if [[ $# -eq 0 ]]; then
echo "Error: download directory must be provided as an input argument."
exit 1 exit 1
fi fi
if ! command -v aria2c &> /dev/null ; then OUT_DIR=$1
echo "Error: aria2c could not be found. Please install aria2c (sudo apt install aria2)." RODA_ALIGNMENT_DIR=$2
exit 1
fi
DOWNLOAD_DIR="$1" if [[ -d $OUT_DIR ]]; then
DOWNLOAD_MODE="${2:-full_dbs}" # Default mode to full_dbs. echo "${OUT_DIR} already exists. Download failed..."
if [[ "${DOWNLOAD_MODE}" != full_dbs && "${DOWNLOAD_MODE}" != reduced_dbs ]] exit 1
then
echo "DOWNLOAD_MODE ${DOWNLOAD_MODE} not recognized."
exit 1
fi fi
SCRIPT_DIR="$(dirname "$(realpath "$0")")" SERVER=snapshotrsync.rcsb.org # RCSB server name
PORT=873 # port RCSB server is using
echo "Downloading PDB70..." rsync -rlpt -v -z --delete --port=$PORT $SERVER::20220103/pub/pdb/data/structures/divided/mmCIF/ $OUT_DIR 2>&1 > /dev/null
bash "${SCRIPT_DIR}/download_pdb70.sh" "${DOWNLOAD_DIR}"
echo "Downloading PDB mmCIF files..." for f in $(find $OUT_DIR -mindepth 2 -type f); do
bash "${SCRIPT_DIR}/download_pdb_mmcif.sh" "${DOWNLOAD_DIR}" mv $f $OUT_DIR
BASENAME=$(basename $f)
gunzip "${OUT_DIR}/${BASENAME}"
done
if [[ -d openfold/resources/params ]]; then find $OUT_DIR -mindepth 1 -type d,l -delete
ln -s openfold/resources/params "${DOWNLOAD_DIR}/params"
ln -s openfold/resources/openfold_params "${DOWNLOAD_DIR}/openfold_params"
fi
echo "All data downloaded." for d in $(find $RODA_ALIGNMENT_DIR -mindepth 1 -maxdepth 1 -type d); do
BASENAME=$(basename $d)
PDB_ID=$(echo $BASENAME | cut -d '_' -f 1)
CIF_PATH="${OUT_DIR}/${PDB_ID}.cif"
if [[ ! -f $CIF_PATH ]]; then
echo $d
fi
done
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment