Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in / Register
Toggle navigation
Menu
Open sidebar
ModelZoo
alphafold2_jax
Commits
9b18d6a9
Commit
9b18d6a9
authored
Dec 11, 2022
by
Augustin Zidek
Browse files
Release code for v2.3.0
PiperOrigin-RevId: 494507694
parent
4494af84
Changes
30
Expand all
Hide whitespace changes
Inline
Side-by-side
Showing
10 changed files
with
227 additions
and
68 deletions
+227
-68
docs/technical_note_v2.3.0.md
docs/technical_note_v2.3.0.md
+58
-0
notebooks/AlphaFold.ipynb
notebooks/AlphaFold.ipynb
+134
-53
run_alphafold.py
run_alphafold.py
+14
-4
run_alphafold_test.py
run_alphafold_test.py
+8
-2
scripts/download_all_data.sh
scripts/download_all_data.sh
+2
-2
scripts/download_alphafold_params.sh
scripts/download_alphafold_params.sh
+1
-1
scripts/download_mgnify.sh
scripts/download_mgnify.sh
+2
-2
scripts/download_pdb_seqres.sh
scripts/download_pdb_seqres.sh
+4
-0
scripts/download_uniref30.sh
scripts/download_uniref30.sh
+3
-3
setup.py
setup.py
+1
-1
No files found.
docs/technical_note_v2.3.0.md
0 → 100644
View file @
9b18d6a9
# AlphaFold v2.3.0
This technical note describes updates in the code and model weights that were
made to produce AlphaFold v2.3.0 including updated training data.
We have fine-tuned new AlphaFold-Multimer weights using identical model
architecture but a new training cutoff of 2021-09-30. Previously released
versions of AlphaFold and AlphaFold-Multimer were trained using PDB structures
with a release date before 2018-04-30, a cutoff date chosen to coincide with the
start of the 2018 CASP13 assessment. The new training cutoff represents ~30%
more data to train AlphaFold and more importantly includes much more data on
large protein complexes. The new training cutoff includes 4× the number of
electron microscopy structures and in aggregate twice the number of large
structures (more than 2,000 residues)[^1]. Due to the significant increase in
the number of large structures, we are also able to increase the size of
training crops (subsets of the structure used to train AlphaFold) from 384 to
640 residues. These new AlphaFold-Multimer models are expected to be
substantially more accurate on large protein complexes even though we use the
same model architecture and training methodology as our previously released
AlphaFold-Multimer paper.
These models were initially developed in response to a request from the CASP
organizers to better understand baselines for the progress of structure
prediction in CASP15, and because of the significant increase in accuracy for
large targets, we are making them available as the default multimer models.
Since they were developed as baselines, we have emphasized minimal changes to
our previous AlphaFold-Multimer system while accommodating larger complexes.
In particular, we increase the number of chains used at training time from 8 to
20 and increase the maximum number of MSA sequences from 1,152 to 2,048 for 3 of
the 5 AlphaFold-Multimer models.
For the CASP15 baseline, we also used somewhat more expensive inference settings
that have been found externally to improve AlphaFold accuracy. We increase the
number of seeds per model to 20[^2] and increase the maximum number of
recyclings to 20 with early stopping[^3]. Increasing the number of seeds to 20
is recommended for very large or difficult targets but is not the default due to
increased computational time.
Overall, we expect these new models to be the preferred models whenever the
stoichiometry of the complex is known, including known monomeric structures. In
cases where the stoichiometry is unknown, such as in genome-scale prediction, it
is likely that single chain AlphaFold will be more accurate on average unless
the chain has several thousand residues.
The predicted structures used for the CASP15 baselines are available
[
here
](
https://github.com/deepmind/alphafold/blob/main/docs/casp15_predictions.zip
)
.
[
^1
]:
wwPDB
Consortium. "Protein Data Bank: the single global archive for 3D
macromolecular structure data." Nucleic Acids Res. 47, D520–D528 (2018).
[
^2
]:
Johansson-Åkhe,
Isak, and Björn Wallner. "Improving peptide-protein
docking with AlphaFold-Multimer using forced sampling." Frontiers in
bioinformatics 2 (2022): 959160-959160.
[
^3
]:
Gao,
Mu, et al. "AF2Complex predicts direct physical interactions in
multimeric proteins with deep learning." Nature communications 13.1 (2022):
1-13.
notebooks/AlphaFold.ipynb
View file @
9b18d6a9
This diff is collapsed.
Click to expand it.
run_alphafold.py
View file @
9b18d6a9
...
@@ -73,7 +73,7 @@ flags.DEFINE_string('bfd_database_path', None, 'Path to the BFD '
...
@@ -73,7 +73,7 @@ flags.DEFINE_string('bfd_database_path', None, 'Path to the BFD '
'database for use by HHblits.'
)
'database for use by HHblits.'
)
flags
.
DEFINE_string
(
'small_bfd_database_path'
,
None
,
'Path to the small '
flags
.
DEFINE_string
(
'small_bfd_database_path'
,
None
,
'Path to the small '
'version of BFD used with the "reduced_dbs" preset.'
)
'version of BFD used with the "reduced_dbs" preset.'
)
flags
.
DEFINE_string
(
'uni
clust
30_database_path'
,
None
,
'Path to the Uni
clust
30 '
flags
.
DEFINE_string
(
'uni
ref
30_database_path'
,
None
,
'Path to the Uni
Ref
30 '
'database for use by HHblits.'
)
'database for use by HHblits.'
)
flags
.
DEFINE_string
(
'uniprot_database_path'
,
None
,
'Path to the Uniprot '
flags
.
DEFINE_string
(
'uniprot_database_path'
,
None
,
'Path to the Uniprot '
'database for use by JackHMMer.'
)
'database for use by JackHMMer.'
)
...
@@ -181,6 +181,7 @@ def predict_structure(
...
@@ -181,6 +181,7 @@ def predict_structure(
unrelaxed_pdbs
=
{}
unrelaxed_pdbs
=
{}
relaxed_pdbs
=
{}
relaxed_pdbs
=
{}
relax_metrics
=
{}
ranking_confidences
=
{}
ranking_confidences
=
{}
# Run the models.
# Run the models.
...
@@ -239,7 +240,12 @@ def predict_structure(
...
@@ -239,7 +240,12 @@ def predict_structure(
if
amber_relaxer
:
if
amber_relaxer
:
# Relax the prediction.
# Relax the prediction.
t_0
=
time
.
time
()
t_0
=
time
.
time
()
relaxed_pdb_str
,
_
,
_
=
amber_relaxer
.
process
(
prot
=
unrelaxed_protein
)
relaxed_pdb_str
,
_
,
violations
=
amber_relaxer
.
process
(
prot
=
unrelaxed_protein
)
relax_metrics
[
model_name
]
=
{
'remaining_violations'
:
violations
,
'remaining_violations_count'
:
sum
(
violations
)
}
timings
[
f
'relax_
{
model_name
}
'
]
=
time
.
time
()
-
t_0
timings
[
f
'relax_
{
model_name
}
'
]
=
time
.
time
()
-
t_0
relaxed_pdbs
[
model_name
]
=
relaxed_pdb_str
relaxed_pdbs
[
model_name
]
=
relaxed_pdb_str
...
@@ -273,6 +279,10 @@ def predict_structure(
...
@@ -273,6 +279,10 @@ def predict_structure(
timings_output_path
=
os
.
path
.
join
(
output_dir
,
'timings.json'
)
timings_output_path
=
os
.
path
.
join
(
output_dir
,
'timings.json'
)
with
open
(
timings_output_path
,
'w'
)
as
f
:
with
open
(
timings_output_path
,
'w'
)
as
f
:
f
.
write
(
json
.
dumps
(
timings
,
indent
=
4
))
f
.
write
(
json
.
dumps
(
timings
,
indent
=
4
))
if
amber_relaxer
:
relax_metrics_path
=
os
.
path
.
join
(
output_dir
,
'relax_metrics.json'
)
with
open
(
relax_metrics_path
,
'w'
)
as
f
:
f
.
write
(
json
.
dumps
(
relax_metrics
,
indent
=
4
))
def
main
(
argv
):
def
main
(
argv
):
...
@@ -290,7 +300,7 @@ def main(argv):
...
@@ -290,7 +300,7 @@ def main(argv):
should_be_set
=
use_small_bfd
)
should_be_set
=
use_small_bfd
)
_check_flag
(
'bfd_database_path'
,
'db_preset'
,
_check_flag
(
'bfd_database_path'
,
'db_preset'
,
should_be_set
=
not
use_small_bfd
)
should_be_set
=
not
use_small_bfd
)
_check_flag
(
'uni
clust
30_database_path'
,
'db_preset'
,
_check_flag
(
'uni
ref
30_database_path'
,
'db_preset'
,
should_be_set
=
not
use_small_bfd
)
should_be_set
=
not
use_small_bfd
)
run_multimer_system
=
'multimer'
in
FLAGS
.
model_preset
run_multimer_system
=
'multimer'
in
FLAGS
.
model_preset
...
@@ -341,7 +351,7 @@ def main(argv):
...
@@ -341,7 +351,7 @@ def main(argv):
uniref90_database_path
=
FLAGS
.
uniref90_database_path
,
uniref90_database_path
=
FLAGS
.
uniref90_database_path
,
mgnify_database_path
=
FLAGS
.
mgnify_database_path
,
mgnify_database_path
=
FLAGS
.
mgnify_database_path
,
bfd_database_path
=
FLAGS
.
bfd_database_path
,
bfd_database_path
=
FLAGS
.
bfd_database_path
,
uni
clust
30_database_path
=
FLAGS
.
uni
clust
30_database_path
,
uni
ref
30_database_path
=
FLAGS
.
uni
ref
30_database_path
,
small_bfd_database_path
=
FLAGS
.
small_bfd_database_path
,
small_bfd_database_path
=
FLAGS
.
small_bfd_database_path
,
template_searcher
=
template_searcher
,
template_searcher
=
template_searcher
,
template_featurizer
=
template_featurizer
,
template_featurizer
=
template_featurizer
,
...
...
run_alphafold_test.py
View file @
9b18d6a9
...
@@ -14,6 +14,7 @@
...
@@ -14,6 +14,7 @@
"""Tests for run_alphafold."""
"""Tests for run_alphafold."""
import
json
import
os
import
os
from
absl.testing
import
absltest
from
absl.testing
import
absltest
...
@@ -57,7 +58,7 @@ class RunAlphafoldTest(parameterized.TestCase):
...
@@ -57,7 +58,7 @@ class RunAlphafoldTest(parameterized.TestCase):
'max_predicted_aligned_error'
:
np
.
array
(
0.
),
'max_predicted_aligned_error'
:
np
.
array
(
0.
),
}
}
model_runner_mock
.
multimer_mode
=
False
model_runner_mock
.
multimer_mode
=
False
amber_relaxer_mock
.
process
.
return_value
=
(
'RELAXED'
,
None
,
None
)
amber_relaxer_mock
.
process
.
return_value
=
(
'RELAXED'
,
None
,
[
1.
,
0.
,
0.
]
)
out_dir
=
self
.
create_tempdir
().
full_path
out_dir
=
self
.
create_tempdir
().
full_path
fasta_path
=
os
.
path
.
join
(
out_dir
,
'target.fasta'
)
fasta_path
=
os
.
path
.
join
(
out_dir
,
'target.fasta'
)
...
@@ -85,7 +86,12 @@ class RunAlphafoldTest(parameterized.TestCase):
...
@@ -85,7 +86,12 @@ class RunAlphafoldTest(parameterized.TestCase):
'result_model1.pkl'
,
'timings.json'
,
'unrelaxed_model1.pdb'
,
'result_model1.pkl'
,
'timings.json'
,
'unrelaxed_model1.pdb'
,
]
]
if
do_relax
:
if
do_relax
:
expected_files
.
append
(
'relaxed_model1.pdb'
)
expected_files
.
extend
([
'relaxed_model1.pdb'
,
'relax_metrics.json'
])
with
open
(
os
.
path
.
join
(
out_dir
,
'test'
,
'relax_metrics.json'
))
as
f
:
relax_metrics
=
json
.
loads
(
f
.
read
())
self
.
assertDictEqual
({
'model1'
:
{
'remaining_violations'
:
[
1.0
,
0.0
,
0.0
],
'remaining_violations_count'
:
1.0
}},
relax_metrics
)
self
.
assertCountEqual
(
expected_files
,
target_output_files
)
self
.
assertCountEqual
(
expected_files
,
target_output_files
)
# Check that pLDDT is set in the B-factor column.
# Check that pLDDT is set in the B-factor column.
...
...
scripts/download_all_data.sh
View file @
9b18d6a9
...
@@ -59,8 +59,8 @@ bash "${SCRIPT_DIR}/download_pdb70.sh" "${DOWNLOAD_DIR}"
...
@@ -59,8 +59,8 @@ bash "${SCRIPT_DIR}/download_pdb70.sh" "${DOWNLOAD_DIR}"
echo
"Downloading PDB mmCIF files..."
echo
"Downloading PDB mmCIF files..."
bash
"
${
SCRIPT_DIR
}
/download_pdb_mmcif.sh"
"
${
DOWNLOAD_DIR
}
"
bash
"
${
SCRIPT_DIR
}
/download_pdb_mmcif.sh"
"
${
DOWNLOAD_DIR
}
"
echo
"Downloading Uni
clust
30..."
echo
"Downloading Uni
ref
30..."
bash
"
${
SCRIPT_DIR
}
/download_uni
clust
30.sh"
"
${
DOWNLOAD_DIR
}
"
bash
"
${
SCRIPT_DIR
}
/download_uni
ref
30.sh"
"
${
DOWNLOAD_DIR
}
"
echo
"Downloading Uniref90..."
echo
"Downloading Uniref90..."
bash
"
${
SCRIPT_DIR
}
/download_uniref90.sh"
"
${
DOWNLOAD_DIR
}
"
bash
"
${
SCRIPT_DIR
}
/download_uniref90.sh"
"
${
DOWNLOAD_DIR
}
"
...
...
scripts/download_alphafold_params.sh
View file @
9b18d6a9
...
@@ -31,7 +31,7 @@ fi
...
@@ -31,7 +31,7 @@ fi
DOWNLOAD_DIR
=
"
$1
"
DOWNLOAD_DIR
=
"
$1
"
ROOT_DIR
=
"
${
DOWNLOAD_DIR
}
/params"
ROOT_DIR
=
"
${
DOWNLOAD_DIR
}
/params"
SOURCE_URL
=
"https://storage.googleapis.com/alphafold/alphafold_params_2022-
03
-0
2
.tar"
SOURCE_URL
=
"https://storage.googleapis.com/alphafold/alphafold_params_2022-
12
-0
6
.tar"
BASENAME
=
$(
basename
"
${
SOURCE_URL
}
"
)
BASENAME
=
$(
basename
"
${
SOURCE_URL
}
"
)
mkdir
--parents
"
${
ROOT_DIR
}
"
mkdir
--parents
"
${
ROOT_DIR
}
"
...
...
scripts/download_mgnify.sh
View file @
9b18d6a9
...
@@ -32,8 +32,8 @@ fi
...
@@ -32,8 +32,8 @@ fi
DOWNLOAD_DIR
=
"
$1
"
DOWNLOAD_DIR
=
"
$1
"
ROOT_DIR
=
"
${
DOWNLOAD_DIR
}
/mgnify"
ROOT_DIR
=
"
${
DOWNLOAD_DIR
}
/mgnify"
# Mirror of:
# Mirror of:
# ftp://ftp.ebi.ac.uk/pub/databases/metagenomics/peptide_database/20
18_12
/mgy_clusters.fa.gz
# ftp://ftp.ebi.ac.uk/pub/databases/metagenomics/peptide_database/20
22_05
/mgy_clusters.fa.gz
SOURCE_URL
=
"https://storage.googleapis.com/alphafold-databases/
casp14_versions
/mgy_clusters_20
18_12
.fa.gz"
SOURCE_URL
=
"https://storage.googleapis.com/alphafold-databases/
v2.3
/mgy_clusters_20
22_05
.fa.gz"
BASENAME
=
$(
basename
"
${
SOURCE_URL
}
"
)
BASENAME
=
$(
basename
"
${
SOURCE_URL
}
"
)
mkdir
--parents
"
${
ROOT_DIR
}
"
mkdir
--parents
"
${
ROOT_DIR
}
"
...
...
scripts/download_pdb_seqres.sh
View file @
9b18d6a9
...
@@ -36,3 +36,7 @@ BASENAME=$(basename "${SOURCE_URL}")
...
@@ -36,3 +36,7 @@ BASENAME=$(basename "${SOURCE_URL}")
mkdir
--parents
"
${
ROOT_DIR
}
"
mkdir
--parents
"
${
ROOT_DIR
}
"
aria2c
"
${
SOURCE_URL
}
"
--dir
=
"
${
ROOT_DIR
}
"
aria2c
"
${
SOURCE_URL
}
"
--dir
=
"
${
ROOT_DIR
}
"
# Keep only protein sequences.
grep
--after-context
=
1
--no-group-separator
'>.* mol:protein'
"
${
ROOT_DIR
}
/pdb_seqres.txt"
>
"
${
ROOT_DIR
}
/pdb_seqres_filtered.txt"
mv
"
${
ROOT_DIR
}
/pdb_seqres_filtered.txt"
"
${
ROOT_DIR
}
/pdb_seqres.txt"
scripts/download_uni
clust
30.sh
→
scripts/download_uni
ref
30.sh
View file @
9b18d6a9
...
@@ -30,10 +30,10 @@ if ! command -v aria2c &> /dev/null ; then
...
@@ -30,10 +30,10 @@ if ! command -v aria2c &> /dev/null ; then
fi
fi
DOWNLOAD_DIR
=
"
$1
"
DOWNLOAD_DIR
=
"
$1
"
ROOT_DIR
=
"
${
DOWNLOAD_DIR
}
/uni
clust
30"
ROOT_DIR
=
"
${
DOWNLOAD_DIR
}
/uni
ref
30"
# Mirror of:
# Mirror of:
# http://wwwuser.gwdg.de/~compbiol/uniclust/20
18_08/uniclust30_2018_08_hhsuite
.tar.gz
# http
s
://wwwuser.gwdg.de/~compbiol/uniclust/20
21_03/UniRef30_2021_03
.tar.gz
SOURCE_URL
=
"https://storage.googleapis.com/alphafold-databases/
casp14_versions/uniclust30_2018_08_hhsuite
.tar.gz"
SOURCE_URL
=
"https://storage.googleapis.com/alphafold-databases/
v2.3/UniRef30_2021_03
.tar.gz"
BASENAME
=
$(
basename
"
${
SOURCE_URL
}
"
)
BASENAME
=
$(
basename
"
${
SOURCE_URL
}
"
)
mkdir
--parents
"
${
ROOT_DIR
}
"
mkdir
--parents
"
${
ROOT_DIR
}
"
...
...
setup.py
View file @
9b18d6a9
...
@@ -18,7 +18,7 @@ from setuptools import setup
...
@@ -18,7 +18,7 @@ from setuptools import setup
setup
(
setup
(
name
=
'alphafold'
,
name
=
'alphafold'
,
version
=
'2.
2.4
'
,
version
=
'2.
3.0
'
,
description
=
'An implementation of the inference pipeline of AlphaFold v2.0.'
description
=
'An implementation of the inference pipeline of AlphaFold v2.0.'
'This is a completely new model that was entered as AlphaFold2 in CASP14 '
'This is a completely new model that was entered as AlphaFold2 in CASP14 '
'and published in Nature.'
,
'and published in Nature.'
,
...
...
Prev
1
2
Next
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment