Release code for v2.3.0

PiperOrigin-RevId: 494507694

Release code for v2.3.0
PiperOrigin-RevId: 494507694
9b18d6a9 · Augustin Zidek · 4494af84 · 9b18d6a9 · 9b18d6a9 · 9b18d6a9
Commit 9b18d6a9 authored Dec 11, 2022 by Augustin Zidek
10 changed files
--- a/docs/technical_note_v2.3.0.md
+++ b/docs/technical_note_v2.3.0.md
+# AlphaFold v2.3.0
+
+This technical note describes updates in the code and model weights that were
+made to produce AlphaFold v2.3.0 including updated training data.
+
+We have fine-tuned new AlphaFold-Multimer weights using identical model
+architecture but a new training cutoff of 2021-09-30. Previously released
+versions of AlphaFold and AlphaFold-Multimer were trained using PDB structures
+with a release date before 2018-04-30, a cutoff date chosen to coincide with the
+start of the 2018 CASP13 assessment. The new training cutoff represents ~30%
+more data to train AlphaFold and more importantly includes much more data on
+large protein complexes. The new training cutoff includes 4× the number of
+electron microscopy structures and in aggregate twice the number of large
+structures (more than 2,000 residues)[^1]. Due to the significant increase in
+the number of large structures, we are also able to increase the size of
+training crops (subsets of the structure used to train AlphaFold) from 384 to
+640 residues. These new AlphaFold-Multimer models are expected to be
+substantially more accurate on large protein complexes even though we use the
+same model architecture and training methodology as our previously released
+AlphaFold-Multimer paper.
+
+These models were initially developed in response to a request from the CASP
+organizers to better understand baselines for the progress of structure
+prediction in CASP15, and because of the significant increase in accuracy for
+large targets, we are making them available as the default multimer models.
+Since they were developed as baselines, we have emphasized minimal changes to
+our previous AlphaFold-Multimer system while accommodating larger complexes.
+In particular, we increase the number of chains used at training time from 8 to
+20 and increase the maximum number of MSA sequences from 1,152 to 2,048 for 3 of
+the 5 AlphaFold-Multimer models.
+
+For the CASP15 baseline, we also used somewhat more expensive inference settings
+that have been found externally to improve AlphaFold accuracy. We increase the
+number of seeds per model to 20[^2] and increase the maximum number of
+recyclings to 20 with early stopping[^3]. Increasing the number of seeds to 20
+is recommended for very large or difficult targets but is not the default due to
+increased computational time.
+
+Overall, we expect these new models to be the preferred models whenever the
+stoichiometry of the complex is known, including known monomeric structures. In
+cases where the stoichiometry is unknown, such as in genome-scale prediction, it
+is likely that single chain AlphaFold will be more accurate on average unless
+the chain has several thousand residues.
+
+The predicted structures used for the CASP15 baselines are available
+[here](https://github.com/deepmind/alphafold/blob/main/docs/casp15_predictions.zip).
+
+
+[^1]: wwPDB Consortium. "Protein Data Bank: the single global archive for 3D
+  macromolecular structure data." Nucleic Acids Res. 47, D520–D528 (2018).
+
+[^2]: Johansson-Åkhe, Isak, and Björn Wallner. "Improving peptide-protein
+  docking with AlphaFold-Multimer using forced sampling." Frontiers in
+  bioinformatics 2 (2022): 959160-959160.
+
+[^3]: Gao, Mu, et al. "AF2Complex predicts direct physical interactions in
+  multimeric proteins with deep learning." Nature communications 13.1 (2022):
+  1-13.
--- a/notebooks/AlphaFold.ipynb
+++ b/notebooks/AlphaFold.ipynb
@@ -8,15 +8,15 @@
      "source": [
        "# AlphaFold Colab\n",
        "\n",
-        "This Colab notebook allows you to easily predict the structure of a protein using a slightly simplified version of [AlphaFold v2.2.4](https://doi.org/10.1038/s41586-021-03819-2). \n",
+        "This Colab notebook allows you to easily predict the structure of a protein using a slightly simplified version of [AlphaFold v2.3.0](https://doi.org/10.1038/s41586-021-03819-2). \n",
        "\n",
-        "**Differences to AlphaFold v2.2.4**\n",
+        "**Differences to AlphaFold v2.3.0**\n",
        "\n",
-        "In comparison to AlphaFold v2.2.4, this Colab notebook uses **no templates (homologous structures)** and a selected portion of the [BFD database](https://bfd.mmseqs.com/). We have validated these changes on several thousand recent PDB structures. While accuracy will be near-identical to the full AlphaFold system on many targets, a small fraction have a large drop in accuracy due to the smaller MSA and lack of templates. For best reliability, we recommend instead using the [full open source AlphaFold](https://github.com/deepmind/alphafold/), or the [AlphaFold Protein Structure Database](https://alphafold.ebi.ac.uk/).\n",
+        "In comparison to AlphaFold v2.3.0, this Colab notebook uses **no templates (homologous structures)** and a selected portion of the [BFD database](https://bfd.mmseqs.com/). We have validated these changes on several thousand recent PDB structures. While accuracy will be near-identical to the full AlphaFold system on many targets, a small fraction have a large drop in accuracy due to the smaller MSA and lack of templates. For best reliability, we recommend instead using the [full open source AlphaFold](https://github.com/deepmind/alphafold/), or the [AlphaFold Protein Structure Database](https://alphafold.ebi.ac.uk/).\n",
        "\n",
        "**This Colab has a small drop in average accuracy for multimers compared to local AlphaFold installation, for full multimer accuracy it is highly recommended to run [AlphaFold locally](https://github.com/deepmind/alphafold#running-alphafold).** Moreover, the AlphaFold-Multimer requires searching for MSA for every unique sequence in the complex, hence it is substantially slower. If your notebook times-out due to slow multimer MSA search, we recommend either using Colab Pro or running AlphaFold locally.\n",
        "\n",
-        "Please note that this Colab notebook is provided as an early-access prototype and is not a finished product. It is provided for theoretical modelling only and caution should be exercised in its use. \n",
+        "Please note that this Colab notebook is provided for theoretical modelling only and caution should be exercised in its use. \n",
        "\n",
        "The **PAE file format** has been updated to match AFDB. Please see the [AFDB FAQ](https://alphafold.ebi.ac.uk/faq/#faq-7) for a description of the new format.\n",
        "\n",
@@ -67,11 +67,11 @@
      "source": [
        "#@title 1. Install third-party software\n",
        "\n",
-        "#@markdown Please execute this cell by pressing the _Play_ button \n",
-        "#@markdown on the left to download and import third-party software \n",
+        "#@markdown Please execute this cell by pressing the _Play_ button\n",
+        "#@markdown on the left to download and import third-party software\n",
        "#@markdown in this Colab notebook. (See the [acknowledgements](https://github.com/deepmind/alphafold/#acknowledgements) in our readme.)\n",
        "\n",
-        "#@markdown **Note**: This installs the software on the Colab \n",
+        "#@markdown **Note**: This installs the software on the Colab\n",
        "#@markdown notebook in the cloud and not on your computer.\n",
        "\n",
        "from IPython.utils import io\n",
@@ -135,12 +135,11 @@
      "source": [
        "#@title 2. Download AlphaFold\n",
        "\n",
-        "#@markdown Please execute this cell by pressing the *Play* button on \n",
+        "#@markdown Please execute this cell by pressing the *Play* button on\n",
        "#@markdown the left.\n",
        "\n",
        "GIT_REPO = 'https://github.com/deepmind/alphafold'\n",
-        "\n",
-        "SOURCE_URL = 'https://storage.googleapis.com/alphafold/alphafold_params_colab_2022-03-02.tar'\n",
+        "SOURCE_URL = 'https://storage.googleapis.com/alphafold/alphafold_params_colab_2022-12-06.tar'\n",
        "PARAMS_DIR = './alphafold/data/params'\n",
        "PARAMS_PATH = os.path.join(PARAMS_DIR, os.path.basename(SOURCE_URL))\n",
        "\n",
@@ -167,6 +166,7 @@
        "      %shell mkdir -p /opt/conda/lib/python3.8/site-packages/alphafold/common/\n",
        "      %shell cp -f /content/stereo_chemical_props.txt /opt/conda/lib/python3.8/site-packages/alphafold/common/\n",
        "\n",
+        "      # Load parameters\n",
        "      %shell mkdir --parents \"{PARAMS_DIR}\"\n",
        "      %shell wget -O \"{PARAMS_PATH}\" \"{SOURCE_URL}\"\n",
        "      pbar.update(27)\n",
@@ -222,10 +222,17 @@
      "source": [
        "#@title 3. Enter the amino acid sequence(s) to fold ⬇️\n",
        "#@markdown Enter the amino acid sequence(s) to fold:\n",
-        "#@markdown * If you enter only a single sequence, the monomer model will be used.\n",
+        "#@markdown * If you enter only a single sequence, the monomer model will be \n",
+        "#@markdown used (unless you override this below).\n",
        "#@markdown * If you enter multiple sequences, the multimer model will be used.\n",
        "\n",
        "from alphafold.notebooks import notebook_utils\n",
+        "import enum\n",
+        "\n",
+        "@enum.unique\n",
+        "class ModelType(enum.Enum):\n",
+        "  MONOMER = 0\n",
+        "  MULTIMER = 1\n",
        "\n",
        "sequence_1 = 'MAAHKGAEHHHKAAEHHEQAAKHHHAAAEHHEKGEHEQAAHHADTAYAHHKHAEEHAAQAAKHDAEHHAPKPH'  #@param {type:\"string\"}\n",
        "sequence_2 = ''  #@param {type:\"string\"}\n",
@@ -235,20 +242,81 @@
        "sequence_6 = ''  #@param {type:\"string\"}\n",
        "sequence_7 = ''  #@param {type:\"string\"}\n",
        "sequence_8 = ''  #@param {type:\"string\"}\n",
-        "\n",
-        "input_sequences = (sequence_1, sequence_2, sequence_3, sequence_4,\n",
-        "                   sequence_5, sequence_6, sequence_7, sequence_8)\n",
-        "\n",
-        "MIN_SINGLE_SEQUENCE_LENGTH = 16\n",
-        "MAX_SINGLE_SEQUENCE_LENGTH = 2500\n",
-        "MAX_MULTIMER_LENGTH = 2500\n",
-        "\n",
-        "# Validate the input.\n",
-        "sequences, model_type_to_use = notebook_utils.validate_input(\n",
+        "sequence_9 = ''  #@param {type:\"string\"}\n",
+        "sequence_10 = ''  #@param {type:\"string\"}\n",
+        "sequence_11 = ''  #@param {type:\"string\"}\n",
+        "sequence_12 = ''  #@param {type:\"string\"}\n",
+        "sequence_13 = ''  #@param {type:\"string\"}\n",
+        "sequence_14 = ''  #@param {type:\"string\"}\n",
+        "sequence_15 = ''  #@param {type:\"string\"}\n",
+        "sequence_16 = ''  #@param {type:\"string\"}\n",
+        "sequence_17 = ''  #@param {type:\"string\"}\n",
+        "sequence_18 = ''  #@param {type:\"string\"}\n",
+        "sequence_19 = ''  #@param {type:\"string\"}\n",
+        "sequence_20 = ''  #@param {type:\"string\"}\n",
+        "\n",
+        "input_sequences = (\n",
+        "    sequence_1, sequence_2, sequence_3, sequence_4, sequence_5, \n",
+        "    sequence_6, sequence_7, sequence_8, sequence_9, sequence_10,\n",
+        "    sequence_11, sequence_12, sequence_13, sequence_14, sequence_15, \n",
+        "    sequence_16, sequence_17, sequence_18, sequence_19, sequence_20)\n",
+        "\n",
+        "MIN_PER_SEQUENCE_LENGTH = 16\n",
+        "MAX_PER_SEQUENCE_LENGTH = 3400\n",
+        "MAX_MONOMER_MODEL_LENGTH = 2500\n",
+        "MAX_LENGTH = 3400\n",
+        "MAX_VALIDATED_LENGTH = 3000\n",
+        "\n",
+        "#@markdown Select this checkbox to run the multimer model for a single sequence.\n",
+        "#@markdown For proteins that are monomeric in their native form, or for very \n",
+        "#@markdown large single chains you may get better accuracy and memory efficiency\n",
+        "#@markdown by using the multimer model.\n",
+        "#@markdown \n",
+        "#@markdown \n",
+        "#@markdown Due to improved memory efficiency the multimer model has a maximum\n",
+        "#@markdown limit of 3400 residues, while the monomer model has a limit of 2500\n",
+        "#@markdown residues.\n",
+        "\n",
+        "use_multimer_model_for_monomers = False #@param {type:\"boolean\"}\n",
+        "\n",
+        "# Validate the input sequences.\n",
+        "sequences = notebook_utils.clean_and_validate_input_sequences(\n",
        "    input_sequences=input_sequences,\n",
-        "    min_length=MIN_SINGLE_SEQUENCE_LENGTH,\n",
-        "    max_length=MAX_SINGLE_SEQUENCE_LENGTH,\n",
-        "    max_multimer_length=MAX_MULTIMER_LENGTH)"
+        "    min_sequence_length=MIN_PER_SEQUENCE_LENGTH,\n",
+        "    max_sequence_length=MAX_PER_SEQUENCE_LENGTH)\n",
+        "\n",
+        "if len(sequences) == 1:\n",
+        "  if use_multimer_model_for_monomers:\n",
+        "    print('Using the multimer model for single-chain, as requested.')\n",
+        "    model_type_to_use = ModelType.MULTIMER\n",
+        "  else:\n",
+        "    print('Using the single-chain model.')\n",
+        "    model_type_to_use = ModelType.MONOMER\n",
+        "else:\n",
+        "  print(f'Using the multimer model with {len(sequences)} sequences.')\n",
+        "  model_type_to_use = ModelType.MULTIMER\n",
+        "\n",
+        "# Check whether total length exceeds limit.\n",
+        "total_sequence_length = sum([len(seq) for seq in sequences])\n",
+        "if total_sequence_length \u003e MAX_LENGTH:\n",
+        "  raise ValueError('The total sequence length is too long: '\n",
+        "                   f'{total_sequence_length}, while the maximum is '\n",
+        "                   f'{MAX_LENGTH}.')\n",
+        "\n",
+        "# Check whether we exceed the monomer limit.\n",
+        "if model_type_to_use == ModelType.MONOMER:\n",
+        "  if len(sequences[0]) \u003e MAX_MONOMER_MODEL_LENGTH:\n",
+        "    raise ValueError(\n",
+        "        f'Input sequence is too long: {len(sequences[0])} amino acids, while '\n",
+        "        f'the maximum for the monomer model is {MAX_MONOMER_MODEL_LENGTH}. You may '\n",
+        "        'be able to run this sequence with the multimer model by selecting the '\n",
+        "        'use_multimer_model_for_monomers checkbox above.')\n",
+        "    \n",
+        "if total_sequence_length \u003e MAX_VALIDATED_LENGTH:\n",
+        "  print('WARNING: The accuracy of the system has not been fully validated '\n",
+        "        'above 3000 residues, and you may experience long running times or '\n",
+        "        f'run out of memory. Total sequence length is {total_sequence_length} '\n",
+        "        'residues.')\n"
      ]
    },
    {
@@ -263,9 +331,9 @@
        "#@title 4. Search against genetic databases\n",
        "\n",
        "#@markdown Once this cell has been executed, you will see\n",
-        "#@markdown statistics about the multiple sequence alignment \n",
-        "#@markdown (MSA) that will be used by AlphaFold. In particular, \n",
-        "#@markdown you’ll see how well each residue is covered by similar \n",
+        "#@markdown statistics about the multiple sequence alignment\n",
+        "#@markdown (MSA) that will be used by AlphaFold. In particular,\n",
+        "#@markdown you’ll see how well each residue is covered by similar\n",
        "#@markdown sequences in the MSA.\n",
        "\n",
        "# --- Python imports ---\n",
@@ -308,7 +376,7 @@
        "               (90, 100, '#0053D6')]\n",
        "\n",
        "# --- Find the closest source ---\n",
-        "test_url_pattern = 'https://storage.googleapis.com/alphafold-colab{:s}/latest/uniref90_2021_03.fasta.1'\n",
+        "test_url_pattern = 'https://storage.googleapis.com/alphafold-colab{:s}/latest/uniref90_2022_01.fasta.1'\n",
        "ex = futures.ThreadPoolExecutor(3)\n",
        "def fetch(source):\n",
        "  request.urlretrieve(test_url_pattern.format(source))\n",
@@ -325,27 +393,27 @@
        "# The z_value is the number of sequences in a database.\n",
        "MSA_DATABASES = [\n",
        "    {'db_name': 'uniref90',\n",
-        "     'db_path': f'{DB_ROOT_PATH}uniref90_2021_03.fasta',\n",
-        "     'num_streamed_chunks': 59,\n",
-        "     'z_value': 135_301_051},\n",
+        "     'db_path': f'{DB_ROOT_PATH}uniref90_2022_01.fasta',\n",
+        "     'num_streamed_chunks': 62,\n",
+        "     'z_value': 144_113_457},\n",
        "    {'db_name': 'smallbfd',\n",
        "     'db_path': f'{DB_ROOT_PATH}bfd-first_non_consensus_sequences.fasta',\n",
        "     'num_streamed_chunks': 17,\n",
        "     'z_value': 65_984_053},\n",
        "    {'db_name': 'mgnify',\n",
-        "     'db_path': f'{DB_ROOT_PATH}mgy_clusters_2019_05.fasta',\n",
-        "     'num_streamed_chunks': 71,\n",
-        "     'z_value': 304_820_129},\n",
+        "     'db_path': f'{DB_ROOT_PATH}mgy_clusters_2022_05.fasta',\n",
+        "     'num_streamed_chunks': 120,\n",
+        "     'z_value': 623_796_864},\n",
        "]\n",
        "\n",
        "# Search UniProt and construct the all_seq features only for heteromers, not homomers.\n",
-        "if model_type_to_use == notebook_utils.ModelType.MULTIMER and len(set(sequences)) \u003e 1:\n",
+        "if model_type_to_use == ModelType.MULTIMER and len(set(sequences)) \u003e 1:\n",
        "  MSA_DATABASES.extend([\n",
        "      # Swiss-Prot and TrEMBL are concatenated together as UniProt.\n",
        "      {'db_name': 'uniprot',\n",
-        "       'db_path': f'{DB_ROOT_PATH}uniprot_2021_03.fasta',\n",
-        "       'num_streamed_chunks': 98,\n",
-        "       'z_value': 219_174_961 + 565_254},\n",
+        "       'db_path': f'{DB_ROOT_PATH}uniprot_2021_04.fasta',\n",
+        "       'num_streamed_chunks': 101,\n",
+        "       'z_value': 225_013_025 + 565_928},\n",
        "  ])\n",
        "\n",
        "TOTAL_JACKHMMER_CHUNKS = sum([cfg['num_streamed_chunks'] for cfg in MSA_DATABASES])\n",
@@ -426,7 +494,7 @@
        "      num_templates=0, num_res=len(sequence)))\n",
        "\n",
        "  # Construct the all_seq features only for heteromers, not homomers.\n",
-        "  if model_type_to_use == notebook_utils.ModelType.MULTIMER and len(set(sequences)) \u003e 1:\n",
+        "  if model_type_to_use == ModelType.MULTIMER and len(set(sequences)) \u003e 1:\n",
        "    valid_feats = msa_pairing.MSA_FEATURES + (\n",
        "        'msa_species_identifiers',\n",
        "    )\n",
@@ -439,10 +507,10 @@
        "\n",
        "\n",
        "# Do further feature post-processing depending on the model type.\n",
-        "if model_type_to_use == notebook_utils.ModelType.MONOMER:\n",
+        "if model_type_to_use == ModelType.MONOMER:\n",
        "  np_example = features_for_chain[protein.PDB_CHAIN_IDS[0]]\n",
        "\n",
-        "elif model_type_to_use == notebook_utils.ModelType.MULTIMER:\n",
+        "elif model_type_to_use == ModelType.MULTIMER:\n",
        "  all_chain_features = {}\n",
        "  for chain_id, chain_features in features_for_chain.items():\n",
        "    all_chain_features[chain_id] = pipeline_multimer.convert_monomer_features(\n",
@@ -484,10 +552,18 @@
        "\n",
        "relax_use_gpu = False  #@param {type:\"boolean\"}\n",
        "\n",
+        "\n",
+        "#@markdown The multimer model will continue recycling until the predictions stop\n",
+        "#@markdown changing, up to the limit set here. For higher accuracy, at the \n",
+        "#@markdown potential cost of longer inference times, set this to 20.\n",
+        "\n",
+        "multimer_model_max_num_recycles = 3  #@param {type:\"integer\"}\n",
+        "\n",
+        "\n",
        "# --- Run the model ---\n",
-        "if model_type_to_use == notebook_utils.ModelType.MONOMER:\n",
+        "if model_type_to_use == ModelType.MONOMER:\n",
        "  model_names = config.MODEL_PRESETS['monomer'] + ('model_2_ptm',)\n",
-        "elif model_type_to_use == notebook_utils.ModelType.MULTIMER:\n",
+        "elif model_type_to_use == ModelType.MULTIMER:\n",
        "  model_names = config.MODEL_PRESETS['multimer']\n",
        "\n",
        "output_dir = 'prediction'\n",
@@ -503,10 +579,16 @@
        "    pbar.set_description(f'Running {model_name}')\n",
        "\n",
        "    cfg = config.model_config(model_name)\n",
-        "    if model_type_to_use == notebook_utils.ModelType.MONOMER:\n",
+        "\n",
+        "    if model_type_to_use == ModelType.MONOMER:\n",
        "      cfg.data.eval.num_ensemble = 1\n",
-        "    elif model_type_to_use == notebook_utils.ModelType.MULTIMER:\n",
+        "    elif model_type_to_use == ModelType.MULTIMER:\n",
        "      cfg.model.num_ensemble_eval = 1\n",
+        "\n",
+        "    if model_type_to_use == ModelType.MULTIMER:\n",
+        "      cfg.model.num_recycle = multimer_model_max_num_recycles\n",
+        "      cfg.model.recycle_early_stop_tolerance = 0.5\n",
+        "\n",
        "    params = data.get_model_haiku_params(model_name, './alphafold/data')\n",
        "    model_runner = model.RunModel(cfg, params)\n",
        "    processed_feature_dict = model_runner.process_features(np_example, random_seed=0)\n",
@@ -514,7 +596,7 @@
        "\n",
        "    mean_plddt = prediction['plddt'].mean()\n",
        "\n",
-        "    if model_type_to_use == notebook_utils.ModelType.MONOMER:\n",
+        "    if model_type_to_use == ModelType.MONOMER:\n",
        "      if 'predicted_aligned_error' in prediction:\n",
        "        pae_outputs[model_name] = (prediction['predicted_aligned_error'],\n",
        "                                   prediction['max_predicted_aligned_error'])\n",
@@ -523,7 +605,7 @@
        "        # should never get selected.\n",
        "        ranking_confidences[model_name] = prediction['ranking_confidence']\n",
        "        plddts[model_name] = prediction['plddt']\n",
-        "    elif model_type_to_use == notebook_utils.ModelType.MULTIMER:\n",
+        "    elif model_type_to_use == ModelType.MULTIMER:\n",
        "      # Multimer models are sorted by pTM+ipTM.\n",
        "      ranking_confidences[model_name] = prediction['ranking_confidence']\n",
        "      plddts[model_name] = prediction['plddt']\n",
@@ -538,7 +620,7 @@
        "        prediction,\n",
        "        b_factors=b_factors,\n",
        "        remove_leading_feature_dimension=(\n",
-        "            model_type_to_use == notebook_utils.ModelType.MONOMER))\n",
+        "            model_type_to_use == ModelType.MONOMER))\n",
        "    unrelaxed_proteins[model_name] = unrelaxed_protein\n",
        "\n",
        "    # Delete unused outputs to save memory.\n",
@@ -611,7 +693,7 @@
        "  return plt\n",
        "\n",
        "# Show the structure coloured by chain if the multimer model has been used.\n",
-        "if model_type_to_use == notebook_utils.ModelType.MULTIMER:\n",
+        "if model_type_to_use == ModelType.MULTIMER:\n",
        "  multichain_view = py3Dmol.view(width=800, height=600)\n",
        "  multichain_view.addModelsAsFrames(to_visualize_pdb)\n",
        "  multichain_style = {'cartoon': {'colorscheme': 'chain'}}\n",
@@ -785,9 +867,9 @@
        "## Mirrored Databases\n",
        "\n",
        "The following databases have been mirrored by DeepMind, and are available with reference to the following:\n",
-        "* UniProt: v2021\\_03 (unmodified), by The UniProt Consortium, available under a [Creative Commons Attribution-NoDerivatives 4.0 International License](http://creativecommons.org/licenses/by-nd/4.0/).\n",
-        "* UniRef90: v2021\\_03 (unmodified), by The UniProt Consortium, available under a [Creative Commons Attribution-NoDerivatives 4.0 International License](http://creativecommons.org/licenses/by-nd/4.0/).\n",
-        "* MGnify: v2019\\_05 (unmodified), by Mitchell AL et al., available free of all copyright restrictions and made fully and freely available for both non-commercial and commercial use under [CC0 1.0 Universal (CC0 1.0) Public Domain Dedication](https://creativecommons.org/publicdomain/zero/1.0/).\n",
+        "* UniProt: v2021\\_04 (unmodified), by The UniProt Consortium, available under a [Creative Commons Attribution-NoDerivatives 4.0 International License](http://creativecommons.org/licenses/by-nd/4.0/).\n",
+        "* UniRef90: v2022\\_01 (unmodified), by The UniProt Consortium, available under a [Creative Commons Attribution-NoDerivatives 4.0 International License](http://creativecommons.org/licenses/by-nd/4.0/).\n",
+        "* MGnify: v2022\\_05 (unmodified), by Mitchell AL et al., available free of all copyright restrictions and made fully and freely available for both non-commercial and commercial use under [CC0 1.0 Universal (CC0 1.0) Public Domain Dedication](https://creativecommons.org/publicdomain/zero/1.0/).\n",
        "* BFD: (modified), by Steinegger M. and Söding J., modified by DeepMind, available under a [Creative Commons Attribution-ShareAlike 4.0 International License](https://creativecommons.org/licenses/by/4.0/). See the Methods section of the [AlphaFold proteome paper](https://www.nature.com/articles/s41586-021-03828-1) for details."
      ]
    }
@@ -795,7 +877,6 @@
  "metadata": {
    "accelerator": "GPU",
    "colab": {
-      "collapsed_sections": [],
      "name": "AlphaFold.ipynb",
      "private_outputs": true,
      "provenance": []

--- a/run_alphafold.py
+++ b/run_alphafold.py
@@ -73,7 +73,7 @@ flags.DEFINE_string('bfd_database_path', None, 'Path to the BFD '
                    'database for use by HHblits.')
 flags.DEFINE_string('small_bfd_database_path', None, 'Path to the small '
                    'version of BFD used with the "reduced_dbs" preset.')
-flags.DEFINE_string('uniclust30_database_path', None, 'Path to the Uniclust30 '
+flags.DEFINE_string('uniref30_database_path', None, 'Path to the UniRef30 '
                    'database for use by HHblits.')
 flags.DEFINE_string('uniprot_database_path', None, 'Path to the Uniprot '
                    'database for use by JackHMMer.')
@@ -181,6 +181,7 @@ def predict_structure(

  unrelaxed_pdbs = {}
  relaxed_pdbs = {}
+  relax_metrics = {}
  ranking_confidences = {}

  # Run the models.
@@ -239,7 +240,12 @@ def predict_structure(
    if amber_relaxer:
      # Relax the prediction.
      t_0 = time.time()
-      relaxed_pdb_str, _, _ = amber_relaxer.process(prot=unrelaxed_protein)
+      relaxed_pdb_str, _, violations = amber_relaxer.process(
+          prot=unrelaxed_protein)
+      relax_metrics[model_name] = {
+          'remaining_violations': violations,
+          'remaining_violations_count': sum(violations)
+      }
      timings[f'relax_{model_name}'] = time.time() - t_0

      relaxed_pdbs[model_name] = relaxed_pdb_str
@@ -273,6 +279,10 @@ def predict_structure(
  timings_output_path = os.path.join(output_dir, 'timings.json')
  with open(timings_output_path, 'w') as f:
    f.write(json.dumps(timings, indent=4))
+  if amber_relaxer:
+    relax_metrics_path = os.path.join(output_dir, 'relax_metrics.json')
+    with open(relax_metrics_path, 'w') as f:
+      f.write(json.dumps(relax_metrics, indent=4))


 def main(argv):
@@ -290,7 +300,7 @@ def main(argv):
              should_be_set=use_small_bfd)
  _check_flag('bfd_database_path', 'db_preset',
              should_be_set=not use_small_bfd)
-  _check_flag('uniclust30_database_path', 'db_preset',
+  _check_flag('uniref30_database_path', 'db_preset',
              should_be_set=not use_small_bfd)

  run_multimer_system = 'multimer' in FLAGS.model_preset
@@ -341,7 +351,7 @@ def main(argv):
      uniref90_database_path=FLAGS.uniref90_database_path,
      mgnify_database_path=FLAGS.mgnify_database_path,
      bfd_database_path=FLAGS.bfd_database_path,
-      uniclust30_database_path=FLAGS.uniclust30_database_path,
+      uniref30_database_path=FLAGS.uniref30_database_path,
      small_bfd_database_path=FLAGS.small_bfd_database_path,
      template_searcher=template_searcher,
      template_featurizer=template_featurizer,

--- a/run_alphafold_test.py
+++ b/run_alphafold_test.py
@@ -14,6 +14,7 @@

 """Tests for run_alphafold."""

+import json
 import os

 from absl.testing import absltest
@@ -57,7 +58,7 @@ class RunAlphafoldTest(parameterized.TestCase):
        'max_predicted_aligned_error': np.array(0.),
    }
    model_runner_mock.multimer_mode = False
-    amber_relaxer_mock.process.return_value = ('RELAXED', None, None)
+    amber_relaxer_mock.process.return_value = ('RELAXED', None, [1., 0., 0.])

    out_dir = self.create_tempdir().full_path
    fasta_path = os.path.join(out_dir, 'target.fasta')
@@ -85,7 +86,12 @@ class RunAlphafoldTest(parameterized.TestCase):
        'result_model1.pkl', 'timings.json', 'unrelaxed_model1.pdb',
    ]
    if do_relax:
-      expected_files.append('relaxed_model1.pdb')
+      expected_files.extend(['relaxed_model1.pdb', 'relax_metrics.json'])
+      with open(os.path.join(out_dir, 'test', 'relax_metrics.json')) as f:
+        relax_metrics = json.loads(f.read())
+      self.assertDictEqual({'model1': {'remaining_violations': [1.0, 0.0, 0.0],
+                                       'remaining_violations_count': 1.0}},
+                           relax_metrics)
    self.assertCountEqual(expected_files, target_output_files)

    # Check that pLDDT is set in the B-factor column.

--- a/scripts/download_all_data.sh
+++ b/scripts/download_all_data.sh
@@ -59,8 +59,8 @@ bash "${SCRIPT_DIR}/download_pdb70.sh" "${DOWNLOAD_DIR}"
 echo "Downloading PDB mmCIF files..."
 bash "${SCRIPT_DIR}/download_pdb_mmcif.sh" "${DOWNLOAD_DIR}"

-echo "Downloading Uniclust30..."
-bash "${SCRIPT_DIR}/download_uniclust30.sh" "${DOWNLOAD_DIR}"
+echo "Downloading Uniref30..."
+bash "${SCRIPT_DIR}/download_uniref30.sh" "${DOWNLOAD_DIR}"

 echo "Downloading Uniref90..."
 bash "${SCRIPT_DIR}/download_uniref90.sh" "${DOWNLOAD_DIR}"

--- a/scripts/download_alphafold_params.sh
+++ b/scripts/download_alphafold_params.sh
@@ -31,7 +31,7 @@ fi

 DOWNLOAD_DIR="$1"
 ROOT_DIR="${DOWNLOAD_DIR}/params"
-SOURCE_URL="https://storage.googleapis.com/alphafold/alphafold_params_2022-03-02.tar"
+SOURCE_URL="https://storage.googleapis.com/alphafold/alphafold_params_2022-12-06.tar"
 BASENAME=$(basename "${SOURCE_URL}")

 mkdir --parents "${ROOT_DIR}"

--- a/scripts/download_mgnify.sh
+++ b/scripts/download_mgnify.sh
@@ -32,8 +32,8 @@ fi
 DOWNLOAD_DIR="$1"
 ROOT_DIR="${DOWNLOAD_DIR}/mgnify"
 # Mirror of:
-# ftp://ftp.ebi.ac.uk/pub/databases/metagenomics/peptide_database/2018_12/mgy_clusters.fa.gz
-SOURCE_URL="https://storage.googleapis.com/alphafold-databases/casp14_versions/mgy_clusters_2018_12.fa.gz"
+# ftp://ftp.ebi.ac.uk/pub/databases/metagenomics/peptide_database/2022_05/mgy_clusters.fa.gz
+SOURCE_URL="https://storage.googleapis.com/alphafold-databases/v2.3/mgy_clusters_2022_05.fa.gz"
 BASENAME=$(basename "${SOURCE_URL}")

 mkdir --parents "${ROOT_DIR}"

--- a/scripts/download_pdb_seqres.sh
+++ b/scripts/download_pdb_seqres.sh
@@ -36,3 +36,7 @@ BASENAME=$(basename "${SOURCE_URL}")

 mkdir --parents "${ROOT_DIR}"
 aria2c "${SOURCE_URL}" --dir="${ROOT_DIR}"
+
+# Keep only protein sequences.
+grep --after-context=1 --no-group-separator '>.* mol:protein' "${ROOT_DIR}/pdb_seqres.txt" > "${ROOT_DIR}/pdb_seqres_filtered.txt"
+mv "${ROOT_DIR}/pdb_seqres_filtered.txt" "${ROOT_DIR}/pdb_seqres.txt"
--- a/scripts/download_uniclust30.sh
+++ b/scripts/download_uniclust30.sh
@@ -30,10 +30,10 @@ if ! command -v aria2c &> /dev/null ; then
 fi

 DOWNLOAD_DIR="$1"
-ROOT_DIR="${DOWNLOAD_DIR}/uniclust30"
+ROOT_DIR="${DOWNLOAD_DIR}/uniref30"
 # Mirror of:
-# http://wwwuser.gwdg.de/~compbiol/uniclust/2018_08/uniclust30_2018_08_hhsuite.tar.gz
-SOURCE_URL="https://storage.googleapis.com/alphafold-databases/casp14_versions/uniclust30_2018_08_hhsuite.tar.gz"
+# https://wwwuser.gwdg.de/~compbiol/uniclust/2021_03/UniRef30_2021_03.tar.gz
+SOURCE_URL="https://storage.googleapis.com/alphafold-databases/v2.3/UniRef30_2021_03.tar.gz"
 BASENAME=$(basename "${SOURCE_URL}")

 mkdir --parents "${ROOT_DIR}"

--- a/setup.py
+++ b/setup.py
@@ -18,7 +18,7 @@ from setuptools import setup

 setup(
    name='alphafold',
-    version='2.2.4',
+    version='2.3.0',
    description='An implementation of the inference pipeline of AlphaFold v2.0.'
    'This is a completely new model that was entered as AlphaFold2 in CASP14 '
    'and published in Nature.',