Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in / Register
Toggle navigation
Menu
Open sidebar
OpenDAS
OpenFold
Commits
2c7627f3
Commit
2c7627f3
authored
Aug 28, 2022
by
Gustaf Ahdritz
Browse files
Improve RODA download process
parent
ceef010a
Changes
2
Hide whitespace changes
Inline
Side-by-side
Showing
2 changed files
with
69 additions
and
15 deletions
+69
-15
README.md
README.md
+17
-15
scripts/download_roda_pdbs.sh
scripts/download_roda_pdbs.sh
+52
-0
No files found.
README.md
View file @
2c7627f3
...
@@ -82,14 +82,9 @@ To install the HH-suite to `/usr/bin`, run
...
@@ -82,14 +82,9 @@ To install the HH-suite to `/usr/bin`, run
## Usage
## Usage
To download the databases used to train OpenFold and AlphaFold run:
If you intend to generate your own alignments, e.g. for inference, you have two
choices for downloading protein databases, depending on whether you want to use
```
bash
DeepMind's MSA generation pipeline (w/ HMMR & HHblits) or
bash scripts/download_data.sh data/
```
You have two choices for downloading protein databases, depending on whether
you want to use DeepMind's MSA generation pipeline (w/ HMMR & HHblits) or
[
ColabFold
](
https://github.com/sokrypton/ColabFold
)
's, which uses the faster
[
ColabFold
](
https://github.com/sokrypton/ColabFold
)
's, which uses the faster
MMseqs2 instead. For the former, run:
MMseqs2 instead. For the former, run:
...
@@ -108,9 +103,21 @@ Make sure to run the latter command on the machine that will be used for MSA
...
@@ -108,9 +103,21 @@ Make sure to run the latter command on the machine that will be used for MSA
generation (the script estimates how the precomputed database index used by
generation (the script estimates how the precomputed database index used by
MMseqs2 should be split according to the memory available on the system).
MMseqs2 should be split according to the memory available on the system).
Alternatively, you can use raw MSAs from our aforementioned MSA database or
If you're using your own precomputed MSAs or MSAs from the RODA repository,
there's no need to download these alignment databases. Simply make sure that
the
`alignment_dir`
contains one directory per chain and that each of these
contains alignments (.sto, .a3m, and .hhr) corresponding to that chain. You
can use
`scripts/flatten_roda.sh`
to reformat RODA downloads in this way.
Note that the RODA alignments are NOT compatible with the recent .cif ground
truth files downloaded by
`scripts/download_alphafold_dbs.sh`
. To fetch .cif
files that match the RODA MSAs, once the alignments are flattened, use
`scripts/download_roda_pdbs.sh`
. That script outputs a list of alignment dirs
for which matching .cif files could not be found. These should be removed from
the alignment directory.
Alternatively, you can use raw MSAs from
[
ProteinNet
](
https://github.com/aqlaboratory/proteinnet
)
. After downloading
[
ProteinNet
](
https://github.com/aqlaboratory/proteinnet
)
. After downloading
th
e latter
database, use
`scripts/prep_proteinnet_msas.py`
to convert the data
th
at
database, use
`scripts/prep_proteinnet_msas.py`
to convert the data
into a format recognized by the OpenFold parser. The resulting directory
into a format recognized by the OpenFold parser. The resulting directory
becomes the
`alignment_dir`
used in subsequent steps. Use
becomes the
`alignment_dir`
used in subsequent steps. Use
`scripts/unpack_proteinnet.py`
to extract
`.core`
files from ProteinNet text
`scripts/unpack_proteinnet.py`
to extract
`.core`
files from ProteinNet text
...
@@ -324,11 +331,6 @@ multi-node distributed training, validation, and so on. For more information,
...
@@ -324,11 +331,6 @@ multi-node distributed training, validation, and so on. For more information,
consult PyTorch Lightning documentation and the
`--help`
flag of the training
consult PyTorch Lightning documentation and the
`--help`
flag of the training
script.
script.
If you're using your own MSAs or MSAs from the RODA repository, make sure that
the
`alignment_dir`
contains one directory per chain and that each of these
contains alignments (.sto, .a3m, and .hhr) corresponding to that chain. You
can use
`scripts/flatten_roda.sh`
to reformat RODA downloads in this way.
Note that, despite its variable name,
`mmcif_dir`
can also contain PDB files
Note that, despite its variable name,
`mmcif_dir`
can also contain PDB files
or even ProteinNet .core files. To emulate the AlphaFold training procedure,
or even ProteinNet .core files. To emulate the AlphaFold training procedure,
which uses a self-distillation set subject to special preprocessing steps, use
which uses a self-distillation set subject to special preprocessing steps, use
...
...
scripts/download_
data
.sh
→
scripts/download_
roda_pdbs
.sh
100644 → 100755
View file @
2c7627f3
#!/bin/bash
#!/bin/bash
#
#
# Copyright 2021
DeepMind Technologies Limited
# Copyright 2021
AlQuraishi Laboratories
#
#
# Licensed under the Apache License, Version 2.0 (the "License");
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# you may not use this file except in compliance with the License.
...
@@ -14,40 +14,39 @@
...
@@ -14,40 +14,39 @@
# See the License for the specific language governing permissions and
# See the License for the specific language governing permissions and
# limitations under the License.
# limitations under the License.
#
#
# Downloads and unzips all required data for AlphaFold.
# Downloads .cif files matching the RODA alignments. Outputs a list of
#
# RODA alignments for which .cif files could not be found..
# Usage: bash download_all_data.sh /path/to/download/directory
if
[[
$#
!=
2
]]
;
then
set
-e
echo
"usage: ./download_roda_pdbs.sh <out_dir> <roda_pdb_alignment_dir>"
if
[[
$#
-eq
0
]]
;
then
echo
"Error: download directory must be provided as an input argument."
exit
1
exit
1
fi
fi
if
!
command
-v
aria2c &> /dev/null
;
then
OUT_DIR
=
$1
echo
"Error: aria2c could not be found. Please install aria2c (sudo apt install aria2)."
RODA_ALIGNMENT_DIR
=
$2
exit
1
fi
DOWNLOAD_DIR
=
"
$1
"
if
[[
-d
$OUT_DIR
]]
;
then
DOWNLOAD_MODE
=
"
${
2
:-
full_dbs
}
"
# Default mode to full_dbs.
echo
"
${
OUT_DIR
}
already exists. Download failed..."
if
[[
"
${
DOWNLOAD_MODE
}
"
!=
full_dbs
&&
"
${
DOWNLOAD_MODE
}
"
!=
reduced_dbs
]]
exit
1
then
echo
"DOWNLOAD_MODE
${
DOWNLOAD_MODE
}
not recognized."
exit
1
fi
fi
SCRIPT_DIR
=
"
$(
dirname
"
$(
realpath
"
$0
"
)
"
)
"
SERVER
=
snapshotrsync.rcsb.org
# RCSB server name
PORT
=
873
# port RCSB server is using
echo
"Downloading PDB70..."
rsync
-rlpt
-v
-z
--delete
--port
=
$PORT
$SERVER
::20220103/pub/pdb/data/structures/divided/mmCIF/
$OUT_DIR
2>&1
>
/dev/null
bash
"
${
SCRIPT_DIR
}
/download_pdb70.sh"
"
${
DOWNLOAD_DIR
}
"
echo
"Downloading PDB mmCIF files..."
for
f
in
$(
find
$OUT_DIR
-mindepth
2
-type
f
)
;
do
bash
"
${
SCRIPT_DIR
}
/download_pdb_mmcif.sh"
"
${
DOWNLOAD_DIR
}
"
mv
$f
$OUT_DIR
BASENAME
=
$(
basename
$f
)
gunzip
"
${
OUT_DIR
}
/
${
BASENAME
}
"
done
if
[[
-d
openfold/resources/params
]]
;
then
find
$OUT_DIR
-mindepth
1
-type
d,l
-delete
ln
-s
openfold/resources/params
"
${
DOWNLOAD_DIR
}
/params"
ln
-s
openfold/resources/openfold_params
"
${
DOWNLOAD_DIR
}
/openfold_params"
fi
echo
"All data downloaded."
for
d
in
$(
find
$RODA_ALIGNMENT_DIR
-mindepth
1
-maxdepth
1
-type
d
)
;
do
BASENAME
=
$(
basename
$d
)
PDB_ID
=
$(
echo
$BASENAME
|
cut
-d
'_'
-f
1
)
CIF_PATH
=
"
${
OUT_DIR
}
/
${
PDB_ID
}
.cif"
if
[[
!
-f
$CIF_PATH
]]
;
then
echo
$d
fi
done
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment