Commit 9b18d6a9 authored by Augustin Zidek's avatar Augustin Zidek
Browse files

Release code for v2.3.0

PiperOrigin-RevId: 494507694
parent 4494af84
# AlphaFold v2.3.0
This technical note describes updates in the code and model weights that were
made to produce AlphaFold v2.3.0 including updated training data.
We have fine-tuned new AlphaFold-Multimer weights using identical model
architecture but a new training cutoff of 2021-09-30. Previously released
versions of AlphaFold and AlphaFold-Multimer were trained using PDB structures
with a release date before 2018-04-30, a cutoff date chosen to coincide with the
start of the 2018 CASP13 assessment. The new training cutoff represents ~30%
more data to train AlphaFold and more importantly includes much more data on
large protein complexes. The new training cutoff includes 4× the number of
electron microscopy structures and in aggregate twice the number of large
structures (more than 2,000 residues)[^1]. Due to the significant increase in
the number of large structures, we are also able to increase the size of
training crops (subsets of the structure used to train AlphaFold) from 384 to
640 residues. These new AlphaFold-Multimer models are expected to be
substantially more accurate on large protein complexes even though we use the
same model architecture and training methodology as our previously released
AlphaFold-Multimer paper.
These models were initially developed in response to a request from the CASP
organizers to better understand baselines for the progress of structure
prediction in CASP15, and because of the significant increase in accuracy for
large targets, we are making them available as the default multimer models.
Since they were developed as baselines, we have emphasized minimal changes to
our previous AlphaFold-Multimer system while accommodating larger complexes.
In particular, we increase the number of chains used at training time from 8 to
20 and increase the maximum number of MSA sequences from 1,152 to 2,048 for 3 of
the 5 AlphaFold-Multimer models.
For the CASP15 baseline, we also used somewhat more expensive inference settings
that have been found externally to improve AlphaFold accuracy. We increase the
number of seeds per model to 20[^2] and increase the maximum number of
recyclings to 20 with early stopping[^3]. Increasing the number of seeds to 20
is recommended for very large or difficult targets but is not the default due to
increased computational time.
Overall, we expect these new models to be the preferred models whenever the
stoichiometry of the complex is known, including known monomeric structures. In
cases where the stoichiometry is unknown, such as in genome-scale prediction, it
is likely that single chain AlphaFold will be more accurate on average unless
the chain has several thousand residues.
The predicted structures used for the CASP15 baselines are available
[here](https://github.com/deepmind/alphafold/blob/main/docs/casp15_predictions.zip).
[^1]: wwPDB Consortium. "Protein Data Bank: the single global archive for 3D
macromolecular structure data." Nucleic Acids Res. 47, D520–D528 (2018).
[^2]: Johansson-Åkhe, Isak, and Björn Wallner. "Improving peptide-protein
docking with AlphaFold-Multimer using forced sampling." Frontiers in
bioinformatics 2 (2022): 959160-959160.
[^3]: Gao, Mu, et al. "AF2Complex predicts direct physical interactions in
multimeric proteins with deep learning." Nature communications 13.1 (2022):
1-13.
......@@ -8,15 +8,15 @@
"source": [
"# AlphaFold Colab\n",
"\n",
"This Colab notebook allows you to easily predict the structure of a protein using a slightly simplified version of [AlphaFold v2.2.4](https://doi.org/10.1038/s41586-021-03819-2). \n",
"This Colab notebook allows you to easily predict the structure of a protein using a slightly simplified version of [AlphaFold v2.3.0](https://doi.org/10.1038/s41586-021-03819-2). \n",
"\n",
"**Differences to AlphaFold v2.2.4**\n",
"**Differences to AlphaFold v2.3.0**\n",
"\n",
"In comparison to AlphaFold v2.2.4, this Colab notebook uses **no templates (homologous structures)** and a selected portion of the [BFD database](https://bfd.mmseqs.com/). We have validated these changes on several thousand recent PDB structures. While accuracy will be near-identical to the full AlphaFold system on many targets, a small fraction have a large drop in accuracy due to the smaller MSA and lack of templates. For best reliability, we recommend instead using the [full open source AlphaFold](https://github.com/deepmind/alphafold/), or the [AlphaFold Protein Structure Database](https://alphafold.ebi.ac.uk/).\n",
"In comparison to AlphaFold v2.3.0, this Colab notebook uses **no templates (homologous structures)** and a selected portion of the [BFD database](https://bfd.mmseqs.com/). We have validated these changes on several thousand recent PDB structures. While accuracy will be near-identical to the full AlphaFold system on many targets, a small fraction have a large drop in accuracy due to the smaller MSA and lack of templates. For best reliability, we recommend instead using the [full open source AlphaFold](https://github.com/deepmind/alphafold/), or the [AlphaFold Protein Structure Database](https://alphafold.ebi.ac.uk/).\n",
"\n",
"**This Colab has a small drop in average accuracy for multimers compared to local AlphaFold installation, for full multimer accuracy it is highly recommended to run [AlphaFold locally](https://github.com/deepmind/alphafold#running-alphafold).** Moreover, the AlphaFold-Multimer requires searching for MSA for every unique sequence in the complex, hence it is substantially slower. If your notebook times-out due to slow multimer MSA search, we recommend either using Colab Pro or running AlphaFold locally.\n",
"\n",
"Please note that this Colab notebook is provided as an early-access prototype and is not a finished product. It is provided for theoretical modelling only and caution should be exercised in its use. \n",
"Please note that this Colab notebook is provided for theoretical modelling only and caution should be exercised in its use. \n",
"\n",
"The **PAE file format** has been updated to match AFDB. Please see the [AFDB FAQ](https://alphafold.ebi.ac.uk/faq/#faq-7) for a description of the new format.\n",
"\n",
......@@ -67,11 +67,11 @@
"source": [
"#@title 1. Install third-party software\n",
"\n",
"#@markdown Please execute this cell by pressing the _Play_ button \n",
"#@markdown on the left to download and import third-party software \n",
"#@markdown Please execute this cell by pressing the _Play_ button\n",
"#@markdown on the left to download and import third-party software\n",
"#@markdown in this Colab notebook. (See the [acknowledgements](https://github.com/deepmind/alphafold/#acknowledgements) in our readme.)\n",
"\n",
"#@markdown **Note**: This installs the software on the Colab \n",
"#@markdown **Note**: This installs the software on the Colab\n",
"#@markdown notebook in the cloud and not on your computer.\n",
"\n",
"from IPython.utils import io\n",
......@@ -135,12 +135,11 @@
"source": [
"#@title 2. Download AlphaFold\n",
"\n",
"#@markdown Please execute this cell by pressing the *Play* button on \n",
"#@markdown Please execute this cell by pressing the *Play* button on\n",
"#@markdown the left.\n",
"\n",
"GIT_REPO = 'https://github.com/deepmind/alphafold'\n",
"\n",
"SOURCE_URL = 'https://storage.googleapis.com/alphafold/alphafold_params_colab_2022-03-02.tar'\n",
"SOURCE_URL = 'https://storage.googleapis.com/alphafold/alphafold_params_colab_2022-12-06.tar'\n",
"PARAMS_DIR = './alphafold/data/params'\n",
"PARAMS_PATH = os.path.join(PARAMS_DIR, os.path.basename(SOURCE_URL))\n",
"\n",
......@@ -167,6 +166,7 @@
" %shell mkdir -p /opt/conda/lib/python3.8/site-packages/alphafold/common/\n",
" %shell cp -f /content/stereo_chemical_props.txt /opt/conda/lib/python3.8/site-packages/alphafold/common/\n",
"\n",
" # Load parameters\n",
" %shell mkdir --parents \"{PARAMS_DIR}\"\n",
" %shell wget -O \"{PARAMS_PATH}\" \"{SOURCE_URL}\"\n",
" pbar.update(27)\n",
......@@ -222,10 +222,17 @@
"source": [
"#@title 3. Enter the amino acid sequence(s) to fold ⬇️\n",
"#@markdown Enter the amino acid sequence(s) to fold:\n",
"#@markdown * If you enter only a single sequence, the monomer model will be used.\n",
"#@markdown * If you enter only a single sequence, the monomer model will be \n",
"#@markdown used (unless you override this below).\n",
"#@markdown * If you enter multiple sequences, the multimer model will be used.\n",
"\n",
"from alphafold.notebooks import notebook_utils\n",
"import enum\n",
"\n",
"@enum.unique\n",
"class ModelType(enum.Enum):\n",
" MONOMER = 0\n",
" MULTIMER = 1\n",
"\n",
"sequence_1 = 'MAAHKGAEHHHKAAEHHEQAAKHHHAAAEHHEKGEHEQAAHHADTAYAHHKHAEEHAAQAAKHDAEHHAPKPH' #@param {type:\"string\"}\n",
"sequence_2 = '' #@param {type:\"string\"}\n",
......@@ -235,20 +242,81 @@
"sequence_6 = '' #@param {type:\"string\"}\n",
"sequence_7 = '' #@param {type:\"string\"}\n",
"sequence_8 = '' #@param {type:\"string\"}\n",
"\n",
"input_sequences = (sequence_1, sequence_2, sequence_3, sequence_4,\n",
" sequence_5, sequence_6, sequence_7, sequence_8)\n",
"\n",
"MIN_SINGLE_SEQUENCE_LENGTH = 16\n",
"MAX_SINGLE_SEQUENCE_LENGTH = 2500\n",
"MAX_MULTIMER_LENGTH = 2500\n",
"\n",
"# Validate the input.\n",
"sequences, model_type_to_use = notebook_utils.validate_input(\n",
"sequence_9 = '' #@param {type:\"string\"}\n",
"sequence_10 = '' #@param {type:\"string\"}\n",
"sequence_11 = '' #@param {type:\"string\"}\n",
"sequence_12 = '' #@param {type:\"string\"}\n",
"sequence_13 = '' #@param {type:\"string\"}\n",
"sequence_14 = '' #@param {type:\"string\"}\n",
"sequence_15 = '' #@param {type:\"string\"}\n",
"sequence_16 = '' #@param {type:\"string\"}\n",
"sequence_17 = '' #@param {type:\"string\"}\n",
"sequence_18 = '' #@param {type:\"string\"}\n",
"sequence_19 = '' #@param {type:\"string\"}\n",
"sequence_20 = '' #@param {type:\"string\"}\n",
"\n",
"input_sequences = (\n",
" sequence_1, sequence_2, sequence_3, sequence_4, sequence_5, \n",
" sequence_6, sequence_7, sequence_8, sequence_9, sequence_10,\n",
" sequence_11, sequence_12, sequence_13, sequence_14, sequence_15, \n",
" sequence_16, sequence_17, sequence_18, sequence_19, sequence_20)\n",
"\n",
"MIN_PER_SEQUENCE_LENGTH = 16\n",
"MAX_PER_SEQUENCE_LENGTH = 3400\n",
"MAX_MONOMER_MODEL_LENGTH = 2500\n",
"MAX_LENGTH = 3400\n",
"MAX_VALIDATED_LENGTH = 3000\n",
"\n",
"#@markdown Select this checkbox to run the multimer model for a single sequence.\n",
"#@markdown For proteins that are monomeric in their native form, or for very \n",
"#@markdown large single chains you may get better accuracy and memory efficiency\n",
"#@markdown by using the multimer model.\n",
"#@markdown \n",
"#@markdown \n",
"#@markdown Due to improved memory efficiency the multimer model has a maximum\n",
"#@markdown limit of 3400 residues, while the monomer model has a limit of 2500\n",
"#@markdown residues.\n",
"\n",
"use_multimer_model_for_monomers = False #@param {type:\"boolean\"}\n",
"\n",
"# Validate the input sequences.\n",
"sequences = notebook_utils.clean_and_validate_input_sequences(\n",
" input_sequences=input_sequences,\n",
" min_length=MIN_SINGLE_SEQUENCE_LENGTH,\n",
" max_length=MAX_SINGLE_SEQUENCE_LENGTH,\n",
" max_multimer_length=MAX_MULTIMER_LENGTH)"
" min_sequence_length=MIN_PER_SEQUENCE_LENGTH,\n",
" max_sequence_length=MAX_PER_SEQUENCE_LENGTH)\n",
"\n",
"if len(sequences) == 1:\n",
" if use_multimer_model_for_monomers:\n",
" print('Using the multimer model for single-chain, as requested.')\n",
" model_type_to_use = ModelType.MULTIMER\n",
" else:\n",
" print('Using the single-chain model.')\n",
" model_type_to_use = ModelType.MONOMER\n",
"else:\n",
" print(f'Using the multimer model with {len(sequences)} sequences.')\n",
" model_type_to_use = ModelType.MULTIMER\n",
"\n",
"# Check whether total length exceeds limit.\n",
"total_sequence_length = sum([len(seq) for seq in sequences])\n",
"if total_sequence_length \u003e MAX_LENGTH:\n",
" raise ValueError('The total sequence length is too long: '\n",
" f'{total_sequence_length}, while the maximum is '\n",
" f'{MAX_LENGTH}.')\n",
"\n",
"# Check whether we exceed the monomer limit.\n",
"if model_type_to_use == ModelType.MONOMER:\n",
" if len(sequences[0]) \u003e MAX_MONOMER_MODEL_LENGTH:\n",
" raise ValueError(\n",
" f'Input sequence is too long: {len(sequences[0])} amino acids, while '\n",
" f'the maximum for the monomer model is {MAX_MONOMER_MODEL_LENGTH}. You may '\n",
" 'be able to run this sequence with the multimer model by selecting the '\n",
" 'use_multimer_model_for_monomers checkbox above.')\n",
" \n",
"if total_sequence_length \u003e MAX_VALIDATED_LENGTH:\n",
" print('WARNING: The accuracy of the system has not been fully validated '\n",
" 'above 3000 residues, and you may experience long running times or '\n",
" f'run out of memory. Total sequence length is {total_sequence_length} '\n",
" 'residues.')\n"
]
},
{
......@@ -263,9 +331,9 @@
"#@title 4. Search against genetic databases\n",
"\n",
"#@markdown Once this cell has been executed, you will see\n",
"#@markdown statistics about the multiple sequence alignment \n",
"#@markdown (MSA) that will be used by AlphaFold. In particular, \n",
"#@markdown you’ll see how well each residue is covered by similar \n",
"#@markdown statistics about the multiple sequence alignment\n",
"#@markdown (MSA) that will be used by AlphaFold. In particular,\n",
"#@markdown you’ll see how well each residue is covered by similar\n",
"#@markdown sequences in the MSA.\n",
"\n",
"# --- Python imports ---\n",
......@@ -308,7 +376,7 @@
" (90, 100, '#0053D6')]\n",
"\n",
"# --- Find the closest source ---\n",
"test_url_pattern = 'https://storage.googleapis.com/alphafold-colab{:s}/latest/uniref90_2021_03.fasta.1'\n",
"test_url_pattern = 'https://storage.googleapis.com/alphafold-colab{:s}/latest/uniref90_2022_01.fasta.1'\n",
"ex = futures.ThreadPoolExecutor(3)\n",
"def fetch(source):\n",
" request.urlretrieve(test_url_pattern.format(source))\n",
......@@ -325,27 +393,27 @@
"# The z_value is the number of sequences in a database.\n",
"MSA_DATABASES = [\n",
" {'db_name': 'uniref90',\n",
" 'db_path': f'{DB_ROOT_PATH}uniref90_2021_03.fasta',\n",
" 'num_streamed_chunks': 59,\n",
" 'z_value': 135_301_051},\n",
" 'db_path': f'{DB_ROOT_PATH}uniref90_2022_01.fasta',\n",
" 'num_streamed_chunks': 62,\n",
" 'z_value': 144_113_457},\n",
" {'db_name': 'smallbfd',\n",
" 'db_path': f'{DB_ROOT_PATH}bfd-first_non_consensus_sequences.fasta',\n",
" 'num_streamed_chunks': 17,\n",
" 'z_value': 65_984_053},\n",
" {'db_name': 'mgnify',\n",
" 'db_path': f'{DB_ROOT_PATH}mgy_clusters_2019_05.fasta',\n",
" 'num_streamed_chunks': 71,\n",
" 'z_value': 304_820_129},\n",
" 'db_path': f'{DB_ROOT_PATH}mgy_clusters_2022_05.fasta',\n",
" 'num_streamed_chunks': 120,\n",
" 'z_value': 623_796_864},\n",
"]\n",
"\n",
"# Search UniProt and construct the all_seq features only for heteromers, not homomers.\n",
"if model_type_to_use == notebook_utils.ModelType.MULTIMER and len(set(sequences)) \u003e 1:\n",
"if model_type_to_use == ModelType.MULTIMER and len(set(sequences)) \u003e 1:\n",
" MSA_DATABASES.extend([\n",
" # Swiss-Prot and TrEMBL are concatenated together as UniProt.\n",
" {'db_name': 'uniprot',\n",
" 'db_path': f'{DB_ROOT_PATH}uniprot_2021_03.fasta',\n",
" 'num_streamed_chunks': 98,\n",
" 'z_value': 219_174_961 + 565_254},\n",
" 'db_path': f'{DB_ROOT_PATH}uniprot_2021_04.fasta',\n",
" 'num_streamed_chunks': 101,\n",
" 'z_value': 225_013_025 + 565_928},\n",
" ])\n",
"\n",
"TOTAL_JACKHMMER_CHUNKS = sum([cfg['num_streamed_chunks'] for cfg in MSA_DATABASES])\n",
......@@ -426,7 +494,7 @@
" num_templates=0, num_res=len(sequence)))\n",
"\n",
" # Construct the all_seq features only for heteromers, not homomers.\n",
" if model_type_to_use == notebook_utils.ModelType.MULTIMER and len(set(sequences)) \u003e 1:\n",
" if model_type_to_use == ModelType.MULTIMER and len(set(sequences)) \u003e 1:\n",
" valid_feats = msa_pairing.MSA_FEATURES + (\n",
" 'msa_species_identifiers',\n",
" )\n",
......@@ -439,10 +507,10 @@
"\n",
"\n",
"# Do further feature post-processing depending on the model type.\n",
"if model_type_to_use == notebook_utils.ModelType.MONOMER:\n",
"if model_type_to_use == ModelType.MONOMER:\n",
" np_example = features_for_chain[protein.PDB_CHAIN_IDS[0]]\n",
"\n",
"elif model_type_to_use == notebook_utils.ModelType.MULTIMER:\n",
"elif model_type_to_use == ModelType.MULTIMER:\n",
" all_chain_features = {}\n",
" for chain_id, chain_features in features_for_chain.items():\n",
" all_chain_features[chain_id] = pipeline_multimer.convert_monomer_features(\n",
......@@ -484,10 +552,18 @@
"\n",
"relax_use_gpu = False #@param {type:\"boolean\"}\n",
"\n",
"\n",
"#@markdown The multimer model will continue recycling until the predictions stop\n",
"#@markdown changing, up to the limit set here. For higher accuracy, at the \n",
"#@markdown potential cost of longer inference times, set this to 20.\n",
"\n",
"multimer_model_max_num_recycles = 3 #@param {type:\"integer\"}\n",
"\n",
"\n",
"# --- Run the model ---\n",
"if model_type_to_use == notebook_utils.ModelType.MONOMER:\n",
"if model_type_to_use == ModelType.MONOMER:\n",
" model_names = config.MODEL_PRESETS['monomer'] + ('model_2_ptm',)\n",
"elif model_type_to_use == notebook_utils.ModelType.MULTIMER:\n",
"elif model_type_to_use == ModelType.MULTIMER:\n",
" model_names = config.MODEL_PRESETS['multimer']\n",
"\n",
"output_dir = 'prediction'\n",
......@@ -503,10 +579,16 @@
" pbar.set_description(f'Running {model_name}')\n",
"\n",
" cfg = config.model_config(model_name)\n",
" if model_type_to_use == notebook_utils.ModelType.MONOMER:\n",
"\n",
" if model_type_to_use == ModelType.MONOMER:\n",
" cfg.data.eval.num_ensemble = 1\n",
" elif model_type_to_use == notebook_utils.ModelType.MULTIMER:\n",
" elif model_type_to_use == ModelType.MULTIMER:\n",
" cfg.model.num_ensemble_eval = 1\n",
"\n",
" if model_type_to_use == ModelType.MULTIMER:\n",
" cfg.model.num_recycle = multimer_model_max_num_recycles\n",
" cfg.model.recycle_early_stop_tolerance = 0.5\n",
"\n",
" params = data.get_model_haiku_params(model_name, './alphafold/data')\n",
" model_runner = model.RunModel(cfg, params)\n",
" processed_feature_dict = model_runner.process_features(np_example, random_seed=0)\n",
......@@ -514,7 +596,7 @@
"\n",
" mean_plddt = prediction['plddt'].mean()\n",
"\n",
" if model_type_to_use == notebook_utils.ModelType.MONOMER:\n",
" if model_type_to_use == ModelType.MONOMER:\n",
" if 'predicted_aligned_error' in prediction:\n",
" pae_outputs[model_name] = (prediction['predicted_aligned_error'],\n",
" prediction['max_predicted_aligned_error'])\n",
......@@ -523,7 +605,7 @@
" # should never get selected.\n",
" ranking_confidences[model_name] = prediction['ranking_confidence']\n",
" plddts[model_name] = prediction['plddt']\n",
" elif model_type_to_use == notebook_utils.ModelType.MULTIMER:\n",
" elif model_type_to_use == ModelType.MULTIMER:\n",
" # Multimer models are sorted by pTM+ipTM.\n",
" ranking_confidences[model_name] = prediction['ranking_confidence']\n",
" plddts[model_name] = prediction['plddt']\n",
......@@ -538,7 +620,7 @@
" prediction,\n",
" b_factors=b_factors,\n",
" remove_leading_feature_dimension=(\n",
" model_type_to_use == notebook_utils.ModelType.MONOMER))\n",
" model_type_to_use == ModelType.MONOMER))\n",
" unrelaxed_proteins[model_name] = unrelaxed_protein\n",
"\n",
" # Delete unused outputs to save memory.\n",
......@@ -611,7 +693,7 @@
" return plt\n",
"\n",
"# Show the structure coloured by chain if the multimer model has been used.\n",
"if model_type_to_use == notebook_utils.ModelType.MULTIMER:\n",
"if model_type_to_use == ModelType.MULTIMER:\n",
" multichain_view = py3Dmol.view(width=800, height=600)\n",
" multichain_view.addModelsAsFrames(to_visualize_pdb)\n",
" multichain_style = {'cartoon': {'colorscheme': 'chain'}}\n",
......@@ -785,9 +867,9 @@
"## Mirrored Databases\n",
"\n",
"The following databases have been mirrored by DeepMind, and are available with reference to the following:\n",
"* UniProt: v2021\\_03 (unmodified), by The UniProt Consortium, available under a [Creative Commons Attribution-NoDerivatives 4.0 International License](http://creativecommons.org/licenses/by-nd/4.0/).\n",
"* UniRef90: v2021\\_03 (unmodified), by The UniProt Consortium, available under a [Creative Commons Attribution-NoDerivatives 4.0 International License](http://creativecommons.org/licenses/by-nd/4.0/).\n",
"* MGnify: v2019\\_05 (unmodified), by Mitchell AL et al., available free of all copyright restrictions and made fully and freely available for both non-commercial and commercial use under [CC0 1.0 Universal (CC0 1.0) Public Domain Dedication](https://creativecommons.org/publicdomain/zero/1.0/).\n",
"* UniProt: v2021\\_04 (unmodified), by The UniProt Consortium, available under a [Creative Commons Attribution-NoDerivatives 4.0 International License](http://creativecommons.org/licenses/by-nd/4.0/).\n",
"* UniRef90: v2022\\_01 (unmodified), by The UniProt Consortium, available under a [Creative Commons Attribution-NoDerivatives 4.0 International License](http://creativecommons.org/licenses/by-nd/4.0/).\n",
"* MGnify: v2022\\_05 (unmodified), by Mitchell AL et al., available free of all copyright restrictions and made fully and freely available for both non-commercial and commercial use under [CC0 1.0 Universal (CC0 1.0) Public Domain Dedication](https://creativecommons.org/publicdomain/zero/1.0/).\n",
"* BFD: (modified), by Steinegger M. and Söding J., modified by DeepMind, available under a [Creative Commons Attribution-ShareAlike 4.0 International License](https://creativecommons.org/licenses/by/4.0/). See the Methods section of the [AlphaFold proteome paper](https://www.nature.com/articles/s41586-021-03828-1) for details."
]
}
......@@ -795,7 +877,6 @@
"metadata": {
"accelerator": "GPU",
"colab": {
"collapsed_sections": [],
"name": "AlphaFold.ipynb",
"private_outputs": true,
"provenance": []
......
......@@ -73,7 +73,7 @@ flags.DEFINE_string('bfd_database_path', None, 'Path to the BFD '
'database for use by HHblits.')
flags.DEFINE_string('small_bfd_database_path', None, 'Path to the small '
'version of BFD used with the "reduced_dbs" preset.')
flags.DEFINE_string('uniclust30_database_path', None, 'Path to the Uniclust30 '
flags.DEFINE_string('uniref30_database_path', None, 'Path to the UniRef30 '
'database for use by HHblits.')
flags.DEFINE_string('uniprot_database_path', None, 'Path to the Uniprot '
'database for use by JackHMMer.')
......@@ -181,6 +181,7 @@ def predict_structure(
unrelaxed_pdbs = {}
relaxed_pdbs = {}
relax_metrics = {}
ranking_confidences = {}
# Run the models.
......@@ -239,7 +240,12 @@ def predict_structure(
if amber_relaxer:
# Relax the prediction.
t_0 = time.time()
relaxed_pdb_str, _, _ = amber_relaxer.process(prot=unrelaxed_protein)
relaxed_pdb_str, _, violations = amber_relaxer.process(
prot=unrelaxed_protein)
relax_metrics[model_name] = {
'remaining_violations': violations,
'remaining_violations_count': sum(violations)
}
timings[f'relax_{model_name}'] = time.time() - t_0
relaxed_pdbs[model_name] = relaxed_pdb_str
......@@ -273,6 +279,10 @@ def predict_structure(
timings_output_path = os.path.join(output_dir, 'timings.json')
with open(timings_output_path, 'w') as f:
f.write(json.dumps(timings, indent=4))
if amber_relaxer:
relax_metrics_path = os.path.join(output_dir, 'relax_metrics.json')
with open(relax_metrics_path, 'w') as f:
f.write(json.dumps(relax_metrics, indent=4))
def main(argv):
......@@ -290,7 +300,7 @@ def main(argv):
should_be_set=use_small_bfd)
_check_flag('bfd_database_path', 'db_preset',
should_be_set=not use_small_bfd)
_check_flag('uniclust30_database_path', 'db_preset',
_check_flag('uniref30_database_path', 'db_preset',
should_be_set=not use_small_bfd)
run_multimer_system = 'multimer' in FLAGS.model_preset
......@@ -341,7 +351,7 @@ def main(argv):
uniref90_database_path=FLAGS.uniref90_database_path,
mgnify_database_path=FLAGS.mgnify_database_path,
bfd_database_path=FLAGS.bfd_database_path,
uniclust30_database_path=FLAGS.uniclust30_database_path,
uniref30_database_path=FLAGS.uniref30_database_path,
small_bfd_database_path=FLAGS.small_bfd_database_path,
template_searcher=template_searcher,
template_featurizer=template_featurizer,
......
......@@ -14,6 +14,7 @@
"""Tests for run_alphafold."""
import json
import os
from absl.testing import absltest
......@@ -57,7 +58,7 @@ class RunAlphafoldTest(parameterized.TestCase):
'max_predicted_aligned_error': np.array(0.),
}
model_runner_mock.multimer_mode = False
amber_relaxer_mock.process.return_value = ('RELAXED', None, None)
amber_relaxer_mock.process.return_value = ('RELAXED', None, [1., 0., 0.])
out_dir = self.create_tempdir().full_path
fasta_path = os.path.join(out_dir, 'target.fasta')
......@@ -85,7 +86,12 @@ class RunAlphafoldTest(parameterized.TestCase):
'result_model1.pkl', 'timings.json', 'unrelaxed_model1.pdb',
]
if do_relax:
expected_files.append('relaxed_model1.pdb')
expected_files.extend(['relaxed_model1.pdb', 'relax_metrics.json'])
with open(os.path.join(out_dir, 'test', 'relax_metrics.json')) as f:
relax_metrics = json.loads(f.read())
self.assertDictEqual({'model1': {'remaining_violations': [1.0, 0.0, 0.0],
'remaining_violations_count': 1.0}},
relax_metrics)
self.assertCountEqual(expected_files, target_output_files)
# Check that pLDDT is set in the B-factor column.
......
......@@ -59,8 +59,8 @@ bash "${SCRIPT_DIR}/download_pdb70.sh" "${DOWNLOAD_DIR}"
echo "Downloading PDB mmCIF files..."
bash "${SCRIPT_DIR}/download_pdb_mmcif.sh" "${DOWNLOAD_DIR}"
echo "Downloading Uniclust30..."
bash "${SCRIPT_DIR}/download_uniclust30.sh" "${DOWNLOAD_DIR}"
echo "Downloading Uniref30..."
bash "${SCRIPT_DIR}/download_uniref30.sh" "${DOWNLOAD_DIR}"
echo "Downloading Uniref90..."
bash "${SCRIPT_DIR}/download_uniref90.sh" "${DOWNLOAD_DIR}"
......
......@@ -31,7 +31,7 @@ fi
DOWNLOAD_DIR="$1"
ROOT_DIR="${DOWNLOAD_DIR}/params"
SOURCE_URL="https://storage.googleapis.com/alphafold/alphafold_params_2022-03-02.tar"
SOURCE_URL="https://storage.googleapis.com/alphafold/alphafold_params_2022-12-06.tar"
BASENAME=$(basename "${SOURCE_URL}")
mkdir --parents "${ROOT_DIR}"
......
......@@ -32,8 +32,8 @@ fi
DOWNLOAD_DIR="$1"
ROOT_DIR="${DOWNLOAD_DIR}/mgnify"
# Mirror of:
# ftp://ftp.ebi.ac.uk/pub/databases/metagenomics/peptide_database/2018_12/mgy_clusters.fa.gz
SOURCE_URL="https://storage.googleapis.com/alphafold-databases/casp14_versions/mgy_clusters_2018_12.fa.gz"
# ftp://ftp.ebi.ac.uk/pub/databases/metagenomics/peptide_database/2022_05/mgy_clusters.fa.gz
SOURCE_URL="https://storage.googleapis.com/alphafold-databases/v2.3/mgy_clusters_2022_05.fa.gz"
BASENAME=$(basename "${SOURCE_URL}")
mkdir --parents "${ROOT_DIR}"
......
......@@ -36,3 +36,7 @@ BASENAME=$(basename "${SOURCE_URL}")
mkdir --parents "${ROOT_DIR}"
aria2c "${SOURCE_URL}" --dir="${ROOT_DIR}"
# Keep only protein sequences.
grep --after-context=1 --no-group-separator '>.* mol:protein' "${ROOT_DIR}/pdb_seqres.txt" > "${ROOT_DIR}/pdb_seqres_filtered.txt"
mv "${ROOT_DIR}/pdb_seqres_filtered.txt" "${ROOT_DIR}/pdb_seqres.txt"
......@@ -30,10 +30,10 @@ if ! command -v aria2c &> /dev/null ; then
fi
DOWNLOAD_DIR="$1"
ROOT_DIR="${DOWNLOAD_DIR}/uniclust30"
ROOT_DIR="${DOWNLOAD_DIR}/uniref30"
# Mirror of:
# http://wwwuser.gwdg.de/~compbiol/uniclust/2018_08/uniclust30_2018_08_hhsuite.tar.gz
SOURCE_URL="https://storage.googleapis.com/alphafold-databases/casp14_versions/uniclust30_2018_08_hhsuite.tar.gz"
# https://wwwuser.gwdg.de/~compbiol/uniclust/2021_03/UniRef30_2021_03.tar.gz
SOURCE_URL="https://storage.googleapis.com/alphafold-databases/v2.3/UniRef30_2021_03.tar.gz"
BASENAME=$(basename "${SOURCE_URL}")
mkdir --parents "${ROOT_DIR}"
......
......@@ -18,7 +18,7 @@ from setuptools import setup
setup(
name='alphafold',
version='2.2.4',
version='2.3.0',
description='An implementation of the inference pipeline of AlphaFold v2.0.'
'This is a completely new model that was entered as AlphaFold2 in CASP14 '
'and published in Nature.',
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment