Commit 2c7627f3 authored by Gustaf Ahdritz's avatar Gustaf Ahdritz
Browse files

Improve RODA download process

parent ceef010a
......@@ -82,14 +82,9 @@ To install the HH-suite to `/usr/bin`, run
## Usage
To download the databases used to train OpenFold and AlphaFold run:
```bash
bash scripts/download_data.sh data/
```
You have two choices for downloading protein databases, depending on whether
you want to use DeepMind's MSA generation pipeline (w/ HMMR & HHblits) or
If you intend to generate your own alignments, e.g. for inference, you have two
choices for downloading protein databases, depending on whether you want to use
DeepMind's MSA generation pipeline (w/ HMMR & HHblits) or
[ColabFold](https://github.com/sokrypton/ColabFold)'s, which uses the faster
MMseqs2 instead. For the former, run:
......@@ -108,9 +103,21 @@ Make sure to run the latter command on the machine that will be used for MSA
generation (the script estimates how the precomputed database index used by
MMseqs2 should be split according to the memory available on the system).
Alternatively, you can use raw MSAs from our aforementioned MSA database or
If you're using your own precomputed MSAs or MSAs from the RODA repository,
there's no need to download these alignment databases. Simply make sure that
the `alignment_dir` contains one directory per chain and that each of these
contains alignments (.sto, .a3m, and .hhr) corresponding to that chain. You
can use `scripts/flatten_roda.sh` to reformat RODA downloads in this way.
Note that the RODA alignments are NOT compatible with the recent .cif ground
truth files downloaded by `scripts/download_alphafold_dbs.sh`. To fetch .cif
files that match the RODA MSAs, once the alignments are flattened, use
`scripts/download_roda_pdbs.sh`. That script outputs a list of alignment dirs
for which matching .cif files could not be found. These should be removed from
the alignment directory.
Alternatively, you can use raw MSAs from
[ProteinNet](https://github.com/aqlaboratory/proteinnet). After downloading
the latter database, use `scripts/prep_proteinnet_msas.py` to convert the data
that database, use `scripts/prep_proteinnet_msas.py` to convert the data
into a format recognized by the OpenFold parser. The resulting directory
becomes the `alignment_dir` used in subsequent steps. Use
`scripts/unpack_proteinnet.py` to extract `.core` files from ProteinNet text
......@@ -324,11 +331,6 @@ multi-node distributed training, validation, and so on. For more information,
consult PyTorch Lightning documentation and the `--help` flag of the training
script.
If you're using your own MSAs or MSAs from the RODA repository, make sure that
the `alignment_dir` contains one directory per chain and that each of these
contains alignments (.sto, .a3m, and .hhr) corresponding to that chain. You
can use `scripts/flatten_roda.sh` to reformat RODA downloads in this way.
Note that, despite its variable name, `mmcif_dir` can also contain PDB files
or even ProteinNet .core files. To emulate the AlphaFold training procedure,
which uses a self-distillation set subject to special preprocessing steps, use
......
#!/bin/bash
#
# Copyright 2021 DeepMind Technologies Limited
# Copyright 2021 AlQuraishi Laboratories
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
......@@ -14,40 +14,39 @@
# See the License for the specific language governing permissions and
# limitations under the License.
#
# Downloads and unzips all required data for AlphaFold.
#
# Usage: bash download_all_data.sh /path/to/download/directory
set -e
if [[ $# -eq 0 ]]; then
echo "Error: download directory must be provided as an input argument."
# Downloads .cif files matching the RODA alignments. Outputs a list of
# RODA alignments for which .cif files could not be found..
if [[ $# != 2 ]]; then
echo "usage: ./download_roda_pdbs.sh <out_dir> <roda_pdb_alignment_dir>"
exit 1
fi
if ! command -v aria2c &> /dev/null ; then
echo "Error: aria2c could not be found. Please install aria2c (sudo apt install aria2)."
exit 1
fi
OUT_DIR=$1
RODA_ALIGNMENT_DIR=$2
DOWNLOAD_DIR="$1"
DOWNLOAD_MODE="${2:-full_dbs}" # Default mode to full_dbs.
if [[ "${DOWNLOAD_MODE}" != full_dbs && "${DOWNLOAD_MODE}" != reduced_dbs ]]
then
echo "DOWNLOAD_MODE ${DOWNLOAD_MODE} not recognized."
if [[ -d $OUT_DIR ]]; then
echo "${OUT_DIR} already exists. Download failed..."
exit 1
fi
SCRIPT_DIR="$(dirname "$(realpath "$0")")"
SERVER=snapshotrsync.rcsb.org # RCSB server name
PORT=873 # port RCSB server is using
echo "Downloading PDB70..."
bash "${SCRIPT_DIR}/download_pdb70.sh" "${DOWNLOAD_DIR}"
rsync -rlpt -v -z --delete --port=$PORT $SERVER::20220103/pub/pdb/data/structures/divided/mmCIF/ $OUT_DIR 2>&1 > /dev/null
echo "Downloading PDB mmCIF files..."
bash "${SCRIPT_DIR}/download_pdb_mmcif.sh" "${DOWNLOAD_DIR}"
for f in $(find $OUT_DIR -mindepth 2 -type f); do
mv $f $OUT_DIR
BASENAME=$(basename $f)
gunzip "${OUT_DIR}/${BASENAME}"
done
if [[ -d openfold/resources/params ]]; then
ln -s openfold/resources/params "${DOWNLOAD_DIR}/params"
ln -s openfold/resources/openfold_params "${DOWNLOAD_DIR}/openfold_params"
fi
find $OUT_DIR -mindepth 1 -type d,l -delete
echo "All data downloaded."
for d in $(find $RODA_ALIGNMENT_DIR -mindepth 1 -maxdepth 1 -type d); do
BASENAME=$(basename $d)
PDB_ID=$(echo $BASENAME | cut -d '_' -f 1)
CIF_PATH="${OUT_DIR}/${PDB_ID}.cif"
if [[ ! -f $CIF_PATH ]]; then
echo $d
fi
done
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment