Release code for v2.3.0

PiperOrigin-RevId: 494507694

Release code for v2.3.0
PiperOrigin-RevId: 494507694
9b18d6a9 · Augustin Zidek · 4494af84 · 9b18d6a9 · 9b18d6a9 · 9b18d6a9
Commit 9b18d6a9 authored Dec 11, 2022 by Augustin Zidek
10 changed files
--- a/docs/technical_note_v2.3.0.md
+++ b/docs/technical_note_v2.3.0.md
+# AlphaFold v2.3.0
+
+This technical note describes updates in the code and model weights that were
+made to produce AlphaFold v2.3.0 including updated training data.
+
+We have fine-tuned new AlphaFold-Multimer weights using identical model
+architecture but a new training cutoff of 2021-09-30. Previously released
+versions of AlphaFold and AlphaFold-Multimer were trained using PDB structures
+with a release date before 2018-04-30, a cutoff date chosen to coincide with the
+start of the 2018 CASP13 assessment. The new training cutoff represents ~30%
+more data to train AlphaFold and more importantly includes much more data on
+large protein complexes. The new training cutoff includes 4× the number of
+electron microscopy structures and in aggregate twice the number of large
+structures (more than 2,000 residues)[^1]. Due to the significant increase in
+the number of large structures, we are also able to increase the size of
+training crops (subsets of the structure used to train AlphaFold) from 384 to
+640 residues. These new AlphaFold-Multimer models are expected to be
+substantially more accurate on large protein complexes even though we use the
+same model architecture and training methodology as our previously released
+AlphaFold-Multimer paper.
+
+These models were initially developed in response to a request from the CASP
+organizers to better understand baselines for the progress of structure
+prediction in CASP15, and because of the significant increase in accuracy for
+large targets, we are making them available as the default multimer models.
+Since they were developed as baselines, we have emphasized minimal changes to
+our previous AlphaFold-Multimer system while accommodating larger complexes.
+In particular, we increase the number of chains used at training time from 8 to
+20 and increase the maximum number of MSA sequences from 1,152 to 2,048 for 3 of
+the 5 AlphaFold-Multimer models.
+
+For the CASP15 baseline, we also used somewhat more expensive inference settings
+that have been found externally to improve AlphaFold accuracy. We increase the
+number of seeds per model to 20[^2] and increase the maximum number of
+recyclings to 20 with early stopping[^3]. Increasing the number of seeds to 20
+is recommended for very large or difficult targets but is not the default due to
+increased computational time.
+
+Overall, we expect these new models to be the preferred models whenever the
+stoichiometry of the complex is known, including known monomeric structures. In
+cases where the stoichiometry is unknown, such as in genome-scale prediction, it
+is likely that single chain AlphaFold will be more accurate on average unless
+the chain has several thousand residues.
+
+The predicted structures used for the CASP15 baselines are available
+[here](https://github.com/deepmind/alphafold/blob/main/docs/casp15_predictions.zip).
+
+
+[^1]: wwPDB Consortium. "Protein Data Bank: the single global archive for 3D
+  macromolecular structure data." Nucleic Acids Res. 47, D520–D528 (2018).
+
+[^2]: Johansson-Åkhe, Isak, and Björn Wallner. "Improving peptide-protein
+  docking with AlphaFold-Multimer using forced sampling." Frontiers in
+  bioinformatics 2 (2022): 959160-959160.
+
+[^3]: Gao, Mu, et al. "AF2Complex predicts direct physical interactions in
+  multimeric proteins with deep learning." Nature communications 13.1 (2022):
+  1-13.
--- a/notebooks/AlphaFold.ipynb
+++ b/notebooks/AlphaFold.ipynb
--- a/run_alphafold.py
+++ b/run_alphafold.py
@@ -73,7 +73,7 @@ flags.DEFINE_string('bfd_database_path', None, 'Path to the BFD '
                    'database for use by HHblits.')
 flags.DEFINE_string('small_bfd_database_path', None, 'Path to the small '
                    'version of BFD used with the "reduced_dbs" preset.')
-flags.DEFINE_string('uniclust30_database_path', None, 'Path to the Uniclust30 '
+flags.DEFINE_string('uniref30_database_path', None, 'Path to the UniRef30 '
                    'database for use by HHblits.')
 flags.DEFINE_string('uniprot_database_path', None, 'Path to the Uniprot '
                    'database for use by JackHMMer.')
@@ -181,6 +181,7 @@ def predict_structure(

  unrelaxed_pdbs = {}
  relaxed_pdbs = {}
+  relax_metrics = {}
  ranking_confidences = {}

  # Run the models.
@@ -239,7 +240,12 @@ def predict_structure(
    if amber_relaxer:
      # Relax the prediction.
      t_0 = time.time()
-      relaxed_pdb_str, _, _ = amber_relaxer.process(prot=unrelaxed_protein)
+      relaxed_pdb_str, _, violations = amber_relaxer.process(
+          prot=unrelaxed_protein)
+      relax_metrics[model_name] = {
+          'remaining_violations': violations,
+          'remaining_violations_count': sum(violations)
+      }
      timings[f'relax_{model_name}'] = time.time() - t_0

      relaxed_pdbs[model_name] = relaxed_pdb_str
@@ -273,6 +279,10 @@ def predict_structure(
  timings_output_path = os.path.join(output_dir, 'timings.json')
  with open(timings_output_path, 'w') as f:
    f.write(json.dumps(timings, indent=4))
+  if amber_relaxer:
+    relax_metrics_path = os.path.join(output_dir, 'relax_metrics.json')
+    with open(relax_metrics_path, 'w') as f:
+      f.write(json.dumps(relax_metrics, indent=4))


 def main(argv):
@@ -290,7 +300,7 @@ def main(argv):
              should_be_set=use_small_bfd)
  _check_flag('bfd_database_path', 'db_preset',
              should_be_set=not use_small_bfd)
-  _check_flag('uniclust30_database_path', 'db_preset',
+  _check_flag('uniref30_database_path', 'db_preset',
              should_be_set=not use_small_bfd)

  run_multimer_system = 'multimer' in FLAGS.model_preset
@@ -341,7 +351,7 @@ def main(argv):
      uniref90_database_path=FLAGS.uniref90_database_path,
      mgnify_database_path=FLAGS.mgnify_database_path,
      bfd_database_path=FLAGS.bfd_database_path,
-      uniclust30_database_path=FLAGS.uniclust30_database_path,
+      uniref30_database_path=FLAGS.uniref30_database_path,
      small_bfd_database_path=FLAGS.small_bfd_database_path,
      template_searcher=template_searcher,
      template_featurizer=template_featurizer,

--- a/run_alphafold_test.py
+++ b/run_alphafold_test.py
@@ -14,6 +14,7 @@

 """Tests for run_alphafold."""

+import json
 import os

 from absl.testing import absltest
@@ -57,7 +58,7 @@ class RunAlphafoldTest(parameterized.TestCase):
        'max_predicted_aligned_error': np.array(0.),
    }
    model_runner_mock.multimer_mode = False
-    amber_relaxer_mock.process.return_value = ('RELAXED', None, None)
+    amber_relaxer_mock.process.return_value = ('RELAXED', None, [1., 0., 0.])

    out_dir = self.create_tempdir().full_path
    fasta_path = os.path.join(out_dir, 'target.fasta')
@@ -85,7 +86,12 @@ class RunAlphafoldTest(parameterized.TestCase):
        'result_model1.pkl', 'timings.json', 'unrelaxed_model1.pdb',
    ]
    if do_relax:
-      expected_files.append('relaxed_model1.pdb')
+      expected_files.extend(['relaxed_model1.pdb', 'relax_metrics.json'])
+      with open(os.path.join(out_dir, 'test', 'relax_metrics.json')) as f:
+        relax_metrics = json.loads(f.read())
+      self.assertDictEqual({'model1': {'remaining_violations': [1.0, 0.0, 0.0],
+                                       'remaining_violations_count': 1.0}},
+                           relax_metrics)
    self.assertCountEqual(expected_files, target_output_files)

    # Check that pLDDT is set in the B-factor column.

--- a/scripts/download_all_data.sh
+++ b/scripts/download_all_data.sh
@@ -59,8 +59,8 @@ bash "${SCRIPT_DIR}/download_pdb70.sh" "${DOWNLOAD_DIR}"
 echo "Downloading PDB mmCIF files..."
 bash "${SCRIPT_DIR}/download_pdb_mmcif.sh" "${DOWNLOAD_DIR}"

-echo "Downloading Uniclust30..."
-bash "${SCRIPT_DIR}/download_uniclust30.sh" "${DOWNLOAD_DIR}"
+echo "Downloading Uniref30..."
+bash "${SCRIPT_DIR}/download_uniref30.sh" "${DOWNLOAD_DIR}"

 echo "Downloading Uniref90..."
 bash "${SCRIPT_DIR}/download_uniref90.sh" "${DOWNLOAD_DIR}"

--- a/scripts/download_alphafold_params.sh
+++ b/scripts/download_alphafold_params.sh
@@ -31,7 +31,7 @@ fi

 DOWNLOAD_DIR="$1"
 ROOT_DIR="${DOWNLOAD_DIR}/params"
-SOURCE_URL="https://storage.googleapis.com/alphafold/alphafold_params_2022-03-02.tar"
+SOURCE_URL="https://storage.googleapis.com/alphafold/alphafold_params_2022-12-06.tar"
 BASENAME=$(basename "${SOURCE_URL}")

 mkdir --parents "${ROOT_DIR}"

--- a/scripts/download_mgnify.sh
+++ b/scripts/download_mgnify.sh
@@ -32,8 +32,8 @@ fi
 DOWNLOAD_DIR="$1"
 ROOT_DIR="${DOWNLOAD_DIR}/mgnify"
 # Mirror of:
-# ftp://ftp.ebi.ac.uk/pub/databases/metagenomics/peptide_database/2018_12/mgy_clusters.fa.gz
-SOURCE_URL="https://storage.googleapis.com/alphafold-databases/casp14_versions/mgy_clusters_2018_12.fa.gz"
+# ftp://ftp.ebi.ac.uk/pub/databases/metagenomics/peptide_database/2022_05/mgy_clusters.fa.gz
+SOURCE_URL="https://storage.googleapis.com/alphafold-databases/v2.3/mgy_clusters_2022_05.fa.gz"
 BASENAME=$(basename "${SOURCE_URL}")

 mkdir --parents "${ROOT_DIR}"

--- a/scripts/download_pdb_seqres.sh
+++ b/scripts/download_pdb_seqres.sh
@@ -36,3 +36,7 @@ BASENAME=$(basename "${SOURCE_URL}")

 mkdir --parents "${ROOT_DIR}"
 aria2c "${SOURCE_URL}" --dir="${ROOT_DIR}"
+
+# Keep only protein sequences.
+grep --after-context=1 --no-group-separator '>.* mol:protein' "${ROOT_DIR}/pdb_seqres.txt" > "${ROOT_DIR}/pdb_seqres_filtered.txt"
+mv "${ROOT_DIR}/pdb_seqres_filtered.txt" "${ROOT_DIR}/pdb_seqres.txt"
--- a/scripts/download_uniclust30.sh
+++ b/scripts/download_uniclust30.sh
@@ -30,10 +30,10 @@ if ! command -v aria2c &> /dev/null ; then
 fi

 DOWNLOAD_DIR="$1"
-ROOT_DIR="${DOWNLOAD_DIR}/uniclust30"
+ROOT_DIR="${DOWNLOAD_DIR}/uniref30"
 # Mirror of:
-# http://wwwuser.gwdg.de/~compbiol/uniclust/2018_08/uniclust30_2018_08_hhsuite.tar.gz
-SOURCE_URL="https://storage.googleapis.com/alphafold-databases/casp14_versions/uniclust30_2018_08_hhsuite.tar.gz"
+# https://wwwuser.gwdg.de/~compbiol/uniclust/2021_03/UniRef30_2021_03.tar.gz
+SOURCE_URL="https://storage.googleapis.com/alphafold-databases/v2.3/UniRef30_2021_03.tar.gz"
 BASENAME=$(basename "${SOURCE_URL}")

 mkdir --parents "${ROOT_DIR}"

--- a/setup.py
+++ b/setup.py
@@ -18,7 +18,7 @@ from setuptools import setup

 setup(
    name='alphafold',
-    version='2.2.4',
+    version='2.3.0',
    description='An implementation of the inference pipeline of AlphaFold v2.0.'
    'This is a completely new model that was entered as AlphaFold2 in CASP14 '
    'and published in Nature.',