Commit 9b18d6a9 authored by Augustin Zidek's avatar Augustin Zidek
Browse files

Release code for v2.3.0

PiperOrigin-RevId: 494507694
parent 4494af84
# AlphaFold v2.3.0
This technical note describes updates in the code and model weights that were
made to produce AlphaFold v2.3.0 including updated training data.
We have fine-tuned new AlphaFold-Multimer weights using identical model
architecture but a new training cutoff of 2021-09-30. Previously released
versions of AlphaFold and AlphaFold-Multimer were trained using PDB structures
with a release date before 2018-04-30, a cutoff date chosen to coincide with the
start of the 2018 CASP13 assessment. The new training cutoff represents ~30%
more data to train AlphaFold and more importantly includes much more data on
large protein complexes. The new training cutoff includes 4× the number of
electron microscopy structures and in aggregate twice the number of large
structures (more than 2,000 residues)[^1]. Due to the significant increase in
the number of large structures, we are also able to increase the size of
training crops (subsets of the structure used to train AlphaFold) from 384 to
640 residues. These new AlphaFold-Multimer models are expected to be
substantially more accurate on large protein complexes even though we use the
same model architecture and training methodology as our previously released
AlphaFold-Multimer paper.
These models were initially developed in response to a request from the CASP
organizers to better understand baselines for the progress of structure
prediction in CASP15, and because of the significant increase in accuracy for
large targets, we are making them available as the default multimer models.
Since they were developed as baselines, we have emphasized minimal changes to
our previous AlphaFold-Multimer system while accommodating larger complexes.
In particular, we increase the number of chains used at training time from 8 to
20 and increase the maximum number of MSA sequences from 1,152 to 2,048 for 3 of
the 5 AlphaFold-Multimer models.
For the CASP15 baseline, we also used somewhat more expensive inference settings
that have been found externally to improve AlphaFold accuracy. We increase the
number of seeds per model to 20[^2] and increase the maximum number of
recyclings to 20 with early stopping[^3]. Increasing the number of seeds to 20
is recommended for very large or difficult targets but is not the default due to
increased computational time.
Overall, we expect these new models to be the preferred models whenever the
stoichiometry of the complex is known, including known monomeric structures. In
cases where the stoichiometry is unknown, such as in genome-scale prediction, it
is likely that single chain AlphaFold will be more accurate on average unless
the chain has several thousand residues.
The predicted structures used for the CASP15 baselines are available
[here](https://github.com/deepmind/alphafold/blob/main/docs/casp15_predictions.zip).
[^1]: wwPDB Consortium. "Protein Data Bank: the single global archive for 3D
macromolecular structure data." Nucleic Acids Res. 47, D520–D528 (2018).
[^2]: Johansson-Åkhe, Isak, and Björn Wallner. "Improving peptide-protein
docking with AlphaFold-Multimer using forced sampling." Frontiers in
bioinformatics 2 (2022): 959160-959160.
[^3]: Gao, Mu, et al. "AF2Complex predicts direct physical interactions in
multimeric proteins with deep learning." Nature communications 13.1 (2022):
1-13.
This diff is collapsed.
......@@ -73,7 +73,7 @@ flags.DEFINE_string('bfd_database_path', None, 'Path to the BFD '
'database for use by HHblits.')
flags.DEFINE_string('small_bfd_database_path', None, 'Path to the small '
'version of BFD used with the "reduced_dbs" preset.')
flags.DEFINE_string('uniclust30_database_path', None, 'Path to the Uniclust30 '
flags.DEFINE_string('uniref30_database_path', None, 'Path to the UniRef30 '
'database for use by HHblits.')
flags.DEFINE_string('uniprot_database_path', None, 'Path to the Uniprot '
'database for use by JackHMMer.')
......@@ -181,6 +181,7 @@ def predict_structure(
unrelaxed_pdbs = {}
relaxed_pdbs = {}
relax_metrics = {}
ranking_confidences = {}
# Run the models.
......@@ -239,7 +240,12 @@ def predict_structure(
if amber_relaxer:
# Relax the prediction.
t_0 = time.time()
relaxed_pdb_str, _, _ = amber_relaxer.process(prot=unrelaxed_protein)
relaxed_pdb_str, _, violations = amber_relaxer.process(
prot=unrelaxed_protein)
relax_metrics[model_name] = {
'remaining_violations': violations,
'remaining_violations_count': sum(violations)
}
timings[f'relax_{model_name}'] = time.time() - t_0
relaxed_pdbs[model_name] = relaxed_pdb_str
......@@ -273,6 +279,10 @@ def predict_structure(
timings_output_path = os.path.join(output_dir, 'timings.json')
with open(timings_output_path, 'w') as f:
f.write(json.dumps(timings, indent=4))
if amber_relaxer:
relax_metrics_path = os.path.join(output_dir, 'relax_metrics.json')
with open(relax_metrics_path, 'w') as f:
f.write(json.dumps(relax_metrics, indent=4))
def main(argv):
......@@ -290,7 +300,7 @@ def main(argv):
should_be_set=use_small_bfd)
_check_flag('bfd_database_path', 'db_preset',
should_be_set=not use_small_bfd)
_check_flag('uniclust30_database_path', 'db_preset',
_check_flag('uniref30_database_path', 'db_preset',
should_be_set=not use_small_bfd)
run_multimer_system = 'multimer' in FLAGS.model_preset
......@@ -341,7 +351,7 @@ def main(argv):
uniref90_database_path=FLAGS.uniref90_database_path,
mgnify_database_path=FLAGS.mgnify_database_path,
bfd_database_path=FLAGS.bfd_database_path,
uniclust30_database_path=FLAGS.uniclust30_database_path,
uniref30_database_path=FLAGS.uniref30_database_path,
small_bfd_database_path=FLAGS.small_bfd_database_path,
template_searcher=template_searcher,
template_featurizer=template_featurizer,
......
......@@ -14,6 +14,7 @@
"""Tests for run_alphafold."""
import json
import os
from absl.testing import absltest
......@@ -57,7 +58,7 @@ class RunAlphafoldTest(parameterized.TestCase):
'max_predicted_aligned_error': np.array(0.),
}
model_runner_mock.multimer_mode = False
amber_relaxer_mock.process.return_value = ('RELAXED', None, None)
amber_relaxer_mock.process.return_value = ('RELAXED', None, [1., 0., 0.])
out_dir = self.create_tempdir().full_path
fasta_path = os.path.join(out_dir, 'target.fasta')
......@@ -85,7 +86,12 @@ class RunAlphafoldTest(parameterized.TestCase):
'result_model1.pkl', 'timings.json', 'unrelaxed_model1.pdb',
]
if do_relax:
expected_files.append('relaxed_model1.pdb')
expected_files.extend(['relaxed_model1.pdb', 'relax_metrics.json'])
with open(os.path.join(out_dir, 'test', 'relax_metrics.json')) as f:
relax_metrics = json.loads(f.read())
self.assertDictEqual({'model1': {'remaining_violations': [1.0, 0.0, 0.0],
'remaining_violations_count': 1.0}},
relax_metrics)
self.assertCountEqual(expected_files, target_output_files)
# Check that pLDDT is set in the B-factor column.
......
......@@ -59,8 +59,8 @@ bash "${SCRIPT_DIR}/download_pdb70.sh" "${DOWNLOAD_DIR}"
echo "Downloading PDB mmCIF files..."
bash "${SCRIPT_DIR}/download_pdb_mmcif.sh" "${DOWNLOAD_DIR}"
echo "Downloading Uniclust30..."
bash "${SCRIPT_DIR}/download_uniclust30.sh" "${DOWNLOAD_DIR}"
echo "Downloading Uniref30..."
bash "${SCRIPT_DIR}/download_uniref30.sh" "${DOWNLOAD_DIR}"
echo "Downloading Uniref90..."
bash "${SCRIPT_DIR}/download_uniref90.sh" "${DOWNLOAD_DIR}"
......
......@@ -31,7 +31,7 @@ fi
DOWNLOAD_DIR="$1"
ROOT_DIR="${DOWNLOAD_DIR}/params"
SOURCE_URL="https://storage.googleapis.com/alphafold/alphafold_params_2022-03-02.tar"
SOURCE_URL="https://storage.googleapis.com/alphafold/alphafold_params_2022-12-06.tar"
BASENAME=$(basename "${SOURCE_URL}")
mkdir --parents "${ROOT_DIR}"
......
......@@ -32,8 +32,8 @@ fi
DOWNLOAD_DIR="$1"
ROOT_DIR="${DOWNLOAD_DIR}/mgnify"
# Mirror of:
# ftp://ftp.ebi.ac.uk/pub/databases/metagenomics/peptide_database/2018_12/mgy_clusters.fa.gz
SOURCE_URL="https://storage.googleapis.com/alphafold-databases/casp14_versions/mgy_clusters_2018_12.fa.gz"
# ftp://ftp.ebi.ac.uk/pub/databases/metagenomics/peptide_database/2022_05/mgy_clusters.fa.gz
SOURCE_URL="https://storage.googleapis.com/alphafold-databases/v2.3/mgy_clusters_2022_05.fa.gz"
BASENAME=$(basename "${SOURCE_URL}")
mkdir --parents "${ROOT_DIR}"
......
......@@ -36,3 +36,7 @@ BASENAME=$(basename "${SOURCE_URL}")
mkdir --parents "${ROOT_DIR}"
aria2c "${SOURCE_URL}" --dir="${ROOT_DIR}"
# Keep only protein sequences.
grep --after-context=1 --no-group-separator '>.* mol:protein' "${ROOT_DIR}/pdb_seqres.txt" > "${ROOT_DIR}/pdb_seqres_filtered.txt"
mv "${ROOT_DIR}/pdb_seqres_filtered.txt" "${ROOT_DIR}/pdb_seqres.txt"
......@@ -30,10 +30,10 @@ if ! command -v aria2c &> /dev/null ; then
fi
DOWNLOAD_DIR="$1"
ROOT_DIR="${DOWNLOAD_DIR}/uniclust30"
ROOT_DIR="${DOWNLOAD_DIR}/uniref30"
# Mirror of:
# http://wwwuser.gwdg.de/~compbiol/uniclust/2018_08/uniclust30_2018_08_hhsuite.tar.gz
SOURCE_URL="https://storage.googleapis.com/alphafold-databases/casp14_versions/uniclust30_2018_08_hhsuite.tar.gz"
# https://wwwuser.gwdg.de/~compbiol/uniclust/2021_03/UniRef30_2021_03.tar.gz
SOURCE_URL="https://storage.googleapis.com/alphafold-databases/v2.3/UniRef30_2021_03.tar.gz"
BASENAME=$(basename "${SOURCE_URL}")
mkdir --parents "${ROOT_DIR}"
......
......@@ -18,7 +18,7 @@ from setuptools import setup
setup(
name='alphafold',
version='2.2.4',
version='2.3.0',
description='An implementation of the inference pipeline of AlphaFold v2.0.'
'This is a completely new model that was entered as AlphaFold2 in CASP14 '
'and published in Nature.',
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment