Improve RODA download process

2c7627f3 · Gustaf Ahdritz · ceef010a · 2c7627f3 · 2c7627f3
Commit 2c7627f3 authored Aug 28, 2022 by Gustaf Ahdritz
Hide whitespace changes
Inline Side-by-side

Showing with 69 additions and 15 deletions

README.md README.md +17 -15

scripts/download_roda_pdbs.sh scripts/download_roda_pdbs.sh +52 -0

No files found.
--- a/README.md
+++ b/README.md
@@ -82,14 +82,9 @@ To install the HH-suite to `/usr/bin`, run
 ## Usage
-To download the databases used to train OpenFold and AlphaFold run:
+If you intend to generate your own alignments, e.g. for inference, you have two 
+choices for downloading protein databases, depending on whether you want to use 
-```bash
+DeepMind's MSA generation pipeline (w/ HMMR & HHblits) or 
-bash scripts/download_data.sh data/
-```
-You have two choices for downloading protein databases, depending on whether 
-you want to use DeepMind's MSA generation pipeline (w/ HMMR & HHblits) or 
 [ColabFold](https://github.com/sokrypton/ColabFold)'s, which uses the faster
 MMseqs2 instead. For the former, run:
@@ -108,9 +103,21 @@ Make sure to run the latter command on the machine that will be used for MSA
 generation (the script estimates how the precomputed database index used by
 MMseqs2 should be split according to the memory available on the system).
-Alternatively, you can use raw MSAs from our aforementioned MSA database or
+If you're using your own precomputed MSAs or MSAs from the RODA repository, 
+there's no need to download these alignment databases. Simply make sure that
+the `alignment_dir` contains one directory per chain and that each of these
+contains alignments (.sto, .a3m, and .hhr) corresponding to that chain. You
+can use `scripts/flatten_roda.sh` to reformat RODA downloads in this way.
+Note that the RODA alignments are NOT compatible with the recent .cif ground
+truth files downloaded by `scripts/download_alphafold_dbs.sh`. To fetch .cif 
+files that match the RODA MSAs, once the alignments are flattened, use 
+`scripts/download_roda_pdbs.sh`. That script outputs a list of alignment dirs 
+for which matching .cif files could not be found. These should be removed from 
+the alignment directory.
+Alternatively, you can use raw MSAs from 
 [ProteinNet](https://github.com/aqlaboratory/proteinnet). After downloading
-the latter database, use `scripts/prep_proteinnet_msas.py` to convert the data 
+that database, use `scripts/prep_proteinnet_msas.py` to convert the data 
 into a format recognized by the OpenFold parser. The resulting directory 
 becomes the `alignment_dir` used in subsequent steps. Use 
 `scripts/unpack_proteinnet.py` to extract `.core` files from ProteinNet text 
@@ -324,11 +331,6 @@ multi-node distributed training, validation, and so on. For more information,
 consult PyTorch Lightning documentation and the `--help` flag of the training 
 script.
-If you're using your own MSAs or MSAs from the RODA repository, make sure that
-the `alignment_dir` contains one directory per chain and that each of these
-contains alignments (.sto, .a3m, and .hhr) corresponding to that chain. You
-can use `scripts/flatten_roda.sh` to reformat RODA downloads in this way.
 Note that, despite its variable name, `mmcif_dir` can also contain PDB files 
 or even ProteinNet .core files. To emulate the AlphaFold training procedure, 
 which uses a self-distillation set subject to special preprocessing steps, use

--- a/scripts/download_data.sh
+++ b/scripts/download_data.sh
 #!/bin/bash
 #
-# Copyright 2021 DeepMind Technologies Limited
+# Copyright 2021 AlQuraishi Laboratories
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
@@ -14,40 +14,39 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 #
-# Downloads and unzips all required data for AlphaFold.
+# Downloads .cif files matching the RODA alignments. Outputs a list of 
-#
+# RODA alignments for which .cif files could not be found..
-# Usage: bash download_all_data.sh /path/to/download/directory
+if [[ $# != 2 ]]; then
-set -e
+    echo "usage: ./download_roda_pdbs.sh <out_dir> <roda_pdb_alignment_dir>"
-if [[ $# -eq 0 ]]; then
-    echo "Error: download directory must be provided as an input argument."
    exit 1
 fi
-if ! command -v aria2c &> /dev/null ; then
+OUT_DIR=$1
-    echo "Error: aria2c could not be found. Please install aria2c (sudo apt install aria2)."
+RODA_ALIGNMENT_DIR=$2
-    exit 1
-fi
-DOWNLOAD_DIR="$1"
+if [[ -d $OUT_DIR ]]; then
-DOWNLOAD_MODE="${2:-full_dbs}" # Default mode to full_dbs.
+    echo "${OUT_DIR} already exists. Download failed..."
-if [[ "${DOWNLOAD_MODE}" != full_dbs && "${DOWNLOAD_MODE}" != reduced_dbs ]]
+    exit 1
-then
-  echo "DOWNLOAD_MODE ${DOWNLOAD_MODE} not recognized."
-  exit 1
 fi
-SCRIPT_DIR="$(dirname "$(realpath "$0")")"
+SERVER=snapshotrsync.rcsb.org                       # RCSB server name
+PORT=873                                           # port RCSB server is using
-echo "Downloading PDB70..."
+rsync -rlpt -v -z --delete --port=$PORT $SERVER::20220103/pub/pdb/data/structures/divided/mmCIF/ $OUT_DIR 2>&1 > /dev/null
-bash "${SCRIPT_DIR}/download_pdb70.sh" "${DOWNLOAD_DIR}"
-echo "Downloading PDB mmCIF files..."
+for f in $(find $OUT_DIR -mindepth 2 -type f); do
-bash "${SCRIPT_DIR}/download_pdb_mmcif.sh" "${DOWNLOAD_DIR}"
+    mv $f $OUT_DIR
+    BASENAME=$(basename $f)
+    gunzip "${OUT_DIR}/${BASENAME}"
+done
-if [[ -d openfold/resources/params ]]; then
+find $OUT_DIR -mindepth 1 -type d,l -delete
-    ln -s openfold/resources/params "${DOWNLOAD_DIR}/params"
-    ln -s openfold/resources/openfold_params "${DOWNLOAD_DIR}/openfold_params"
-fi
-echo "All data downloaded."
+for d in $(find $RODA_ALIGNMENT_DIR -mindepth 1 -maxdepth 1 -type d); do
+    BASENAME=$(basename $d)
+    PDB_ID=$(echo $BASENAME | cut -d '_' -f 1)
+    CIF_PATH="${OUT_DIR}/${PDB_ID}.cif"
+    if [[ ! -f $CIF_PATH ]]; then
+        echo $d
+    fi
+done