Add a Colab notebook, add reduced BFD, and various other fixes and improvements.

PiperOrigin-RevId: 386228948

Add a Colab notebook, add reduced BFD, and various other fixes and improvements.
PiperOrigin-RevId: 386228948
0bab1bf8 · Saran Tunyasuvunakool · d26287ea · 0bab1bf8 · 0bab1bf8 · 0bab1bf8
Commit 0bab1bf8 authored Jul 22, 2021 by Saran Tunyasuvunakool
20 changed files
--- a/README.md
+++ b/README.md
@@ -9,7 +9,15 @@ of this document.

 Any publication that discloses findings arising from using this source code or
 the model parameters should [cite](#citing-this-work) the
-[AlphaFold paper](https://doi.org/10.1038/s41586-021-03819-2).
+[AlphaFold paper](https://doi.org/10.1038/s41586-021-03819-2). Please also refer
+to the
+[Supplementary Information](https://static-content.springer.com/esm/art%3A10.1038%2Fs41586-021-03819-2/MediaObjects/41586_2021_3819_MOESM1_ESM.pdf)
+for a detailed description of the method.
+
+**You can use a slightly simplified version of AlphaFold with
+[this Colab
+notebook](https://colab.research.google.com/github/deepmind/alphafold/blob/main/notebooks/AlphaFold.ipynb)**
+or community-supported versions (see below).

 ![CASP14 predictions](imgs/casp14_predictions.gif)

@@ -39,7 +47,7 @@ The following steps are required in order to run AlphaFold:

 ### Genetic databases

-This step requires `rsync` and `aria2c` to be installed on your machine.
+This step requires `aria2c` to be installed on your machine.

 AlphaFold needs multiple genetic (sequence) databases to run:

@@ -51,21 +59,43 @@ AlphaFold needs multiple genetic (sequence) databases to run:
 *   [PDB](https://www.rcsb.org/) (structures in the mmCIF format).

 We provide a script `scripts/download_all_data.sh` that can be used to download
-and set up all of these databases. This should take 8–12 hours.
+and set up all of these databases:
+
+*   Default:

-:ledger: **Note: The total download size is around 428 GB and the total size
-when unzipped is 2.2 TB. Please make sure you have a large enough hard drive
-space, bandwidth and time to download.**
+    ```bash
+    scripts/download_all_data.sh <DOWNLOAD_DIR>
+    ```
+
+    will download the full databases.
+
+*   With `reduced_dbs`:
+
+    ```bash
+    scripts/download_all_data.sh <DOWNLOAD_DIR> reduced_dbs
+    ```
+
+    will download a reduced version of the databases to be used with the
+    `reduced_dbs` preset.
+
+We don't provide exactly the versions used in CASP14 -- see the [note on
+reproducibility](#note-on-reproducibility). Some of the databases are mirrored
+for speed, see [mirrored databases](#mirrored-databases).
+
+:ledger: **Note: The total download size for the full databases is around 415 GB
+and the total size when unzipped is 2.2 TB. Please make sure you have a large
+enough hard drive space, bandwidth and time to download. We recommend using an
+SSD for better genetic search performance.**

 This script will also download the model parameter files. Once the script has
 finished, you should have the following directory structure:

 ```
-$DOWNLOAD_DIR/                             # Total: ~ 2.2 TB (download: 428 GB)
-    bfd/                                   # ~ 1.8 TB (download: 271.6 GB)
+$DOWNLOAD_DIR/                             # Total: ~ 2.2 TB (download: 438 GB)
+    bfd/                                   # ~ 1.7 TB (download: 271.6 GB)
        # 6 files.
    mgnify/                                # ~ 64 GB (download: 32.9 GB)
-        mgy_clusters.fa
+        mgy_clusters_2018_08.fa
    params/                                # ~ 3.5 GB (download: 3.5 GB)
        # 5 CASP14 models,
        # 5 pTM models,
@@ -77,13 +107,18 @@ $DOWNLOAD_DIR/                             # Total: ~ 2.2 TB (download: 428 GB)
        mmcif_files/
            # About 180,000 .cif files.
        obsolete.dat
-    uniclust30/                            # ~ 87 GB (download: 24.9 GB)
+    small_fbd/                             # ~ 17 GB (download: 9.6 GB)
+        bfd-first_non_consensus_sequences.fasta
+    uniclust30/                            # ~ 86 GB (download: 24.9 GB)
        uniclust30_2018_08/
            # 13 files.
-    uniref90/                              # ~ 59 GB (download: 29.7 GB)
+    uniref90/                              # ~ 58 GB (download: 29.7 GB)
        uniref90.fasta
 ```

+`bfd/` is only downloaded if you download the full databasees, and `small_bfd/`
+is only downloaded if you download the reduced databases.
+
 ### Model parameters

 While the AlphaFold code is licensed under the Apache 2.0 License, the AlphaFold
@@ -149,16 +184,20 @@ with 12 vCPUs, 85 GB of RAM, a 100 GB boot disk, the databases on an additional
    [GPU enumeration](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/user-guide.html#gpu-enumeration)
    for more details.

-1.  You can control AlphaFold speed / quality tradeoff by adding either
-    `--preset=full_dbs` or `--preset=casp14` to the run command. We provide the
-    following presets:
+1.  You can control AlphaFold speed / quality tradeoff by adding
+    `--preset=reduced_dbs`, `--preset=full_dbs` or `--preset=casp14` to the run
+    command. We provide the following presets:

-    *   **casp14**: This preset uses the same settings as were used in CASP14.
-        It runs with all genetic databases and with 8 ensemblings.
+    *   **reduced_dbs**: This preset is optimized for speed and lower hardware
+        requirements. It runs with a reduced version of the BFD database and
+        with no ensembling. It requires 8 CPU cores (vCPUs), 8 GB of RAM, and
+        600 GB of disk space.
    *   **full_dbs**: The model in this preset is 8 times faster than the
        `casp14` preset with a very minor quality drop (-0.1 average GDT drop on
        CASP14 domains). It runs with all genetic databases and with no
        ensembling.
+    *   **casp14**: This preset uses the same settings as were used in CASP14.
+        It runs with all genetic databases and with 8 ensemblings.

    Running the command above with the `casp14` preset would look like this:

@@ -174,7 +213,7 @@ structures, raw model outputs, prediction metadata, and section timings. The
 `output_dir` directory will have the following structure:

 ```
-output_dir/
+<target_name>/
    features.pkl
    ranked_{0,1,2,3,4}.pdb
    ranking_debug.json
@@ -190,20 +229,20 @@ output_dir/

 The contents of each output file are as follows:

-*   `features.pkl` – A `pickle` file containing the input feature Numpy arrays
+*   `features.pkl` – A `pickle` file containing the input feature NumPy arrays
    used by the models to produce the structures.
 *   `unrelaxed_model_*.pdb` – A PDB format text file containing the predicted
    structure, exactly as outputted by the model.
 *   `relaxed_model_*.pdb` – A PDB format text file containing the predicted
    structure, after performing an Amber relaxation procedure on the unrelaxed
-    structure prediction, see Jumper et al. 2021, Suppl. Methods 1.8.6 for
-    details.
+    structure prediction (see Jumper et al. 2021, Suppl. Methods 1.8.6 for
+    details).
 *   `ranked_*.pdb` – A PDB format text file containing the relaxed predicted
    structures, after reordering by model confidence. Here `ranked_0.pdb` should
    contain the prediction with the highest confidence, and `ranked_4.pdb` the
    prediction with the lowest confidence. To rank model confidence, we use
-    predicted LDDT (pLDDT), see Jumper et al. 2021, Suppl. Methods 1.9.6 for
-    details.
+    predicted LDDT (pLDDT) scores (see Jumper et al. 2021, Suppl. Methods 1.9.6
+    for details).
 *   `ranking_debug.json` – A JSON format text file containing the pLDDT values
    used to perform the model ranking, and a mapping back to the original model
    names.
@@ -212,10 +251,27 @@ The contents of each output file are as follows:
 *   `msas/` - A directory containing the files describing the various genetic
    tool hits that were used to construct the input MSA.
 *   `result_model_*.pkl` – A `pickle` file containing a nested dictionary of the
-    various Numpy arrays directly produced by the model. In addition to the
-    output of the structure module, this includes auxiliary outputs such as
-    distograms and pLDDT scores. If using the pTM models then the pTM logits
-    will also be contained in this file.
+    various NumPy arrays directly produced by the model. In addition to the
+    output of the structure module, this includes auxiliary outputs such as:
+
+    *   Distograms (`distogram/logits` contains a NumPy array of shape [N_res,
+        N_res, N_bins] and `distogram/bin_edges` contains the definition of the
+        bins).
+    *   Per-residue pLDDT scores (`plddt` contains a NumPy array of shape
+        [N_res] with the range of possible values from `0` to `100`, where `100`
+        means most confident). This can serve to identify sequence regions
+        predicted with high confidence or as an overall per-target confidence
+        score when averaged across residues.
+    *   Present only if using pTM models: predicted TM-score (`ptm` field
+        contains a scalar). As a predictor of a global superposition metric,
+        this score is designed to also assess whether the model is confident in
+        the overall domain packing.
+    *   Present only if using pTM models: predicted pairwise aligned errors
+        (`predicted_aligned_error` contains a NumPy array of shape [N_res,
+        N_res] with the range of possible values from `0` to
+        `max_predicted_aligned_error`, where `0` means most confident). This can
+        serve for a visualisation of domain packing confidence within the
+        structure.

 This code has been tested to match mean top-1 accuracy on a CASP14 test set with
 pLDDT ranking over 5 model predictions (some CASP targets were run with earlier
@@ -284,6 +340,17 @@ If you use the code or data in this package, please cite:
 }
 ```

+## Community contributions
+
+Colab notebooks provided by the community (please note that these notebooks may
+vary from our full AlphaFold system and we did not validate their accuracy):
+
+*   The [ColabFold AlphaFold2 notebook](https://colab.research.google.com/github/sokrypton/ColabFold/blob/main/AlphaFold2.ipynb)
+    by Martin Steinegger, Sergey Ovchinnikov and Milot Mirdita, which uses an
+    API hosted at the Södinglab based on the MMseqs2 server [(Mirdita et al.
+    2019, Bioinformatics)](https://academic.oup.com/bioinformatics/article/35/16/2856/5280135)
+    for the multiple sequence alignment creation.
+
 ## Acknowledgements

 AlphaFold communicates with and/or references the following separate libraries
@@ -292,6 +359,7 @@ and packages:
 *   [Abseil](https://github.com/abseil/abseil-py)
 *   [Biopython](https://biopython.org)
 *   [Chex](https://github.com/deepmind/chex)
+*   [Colab](https://research.google.com/colaboratory/)
 *   [Docker](https://www.docker.com)
 *   [HH Suite](https://github.com/soedinglab/hh-suite)
 *   [HMMER Suite](http://eddylab.org/software/hmmer)
@@ -299,18 +367,20 @@ and packages:
 *   [Immutabledict](https://github.com/corenting/immutabledict)
 *   [JAX](https://github.com/google/jax/)
 *   [Kalign](https://msa.sbc.su.se/cgi-bin/msa.cgi)
+*   [matplotlib](https://matplotlib.org/)
 *   [ML Collections](https://github.com/google/ml_collections)
 *   [NumPy](https://numpy.org)
 *   [OpenMM](https://github.com/openmm/openmm)
 *   [OpenStructure](https://openstructure.org)
+*   [pymol3d](https://github.com/avirshup/py3dmol)
 *   [SciPy](https://scipy.org)
 *   [Sonnet](https://github.com/deepmind/sonnet)
 *   [TensorFlow](https://github.com/tensorflow/tensorflow)
 *   [Tree](https://github.com/deepmind/tree)
+*   [tqdm](https://github.com/tqdm/tqdm)

 We thank all their contributors and maintainers!

-
 ## License and Disclaimer

 This is not an officially supported Google product.
@@ -349,3 +419,10 @@ before use.
 The following databases have been mirrored by DeepMind, and are available with reference to the following:

 *   [BFD](https://bfd.mmseqs.com/) (unmodified), by Steinegger M. and Söding J., available under a [Creative Commons Attribution-ShareAlike 4.0 International License](http://creativecommons.org/licenses/by-sa/4.0/).
+
+*   [BFD](https://bfd.mmseqs.com/) (modified), by Steinegger M. and Söding J., modified by DeepMind, available under a [Creative Commons Attribution-ShareAlike 4.0 International License](http://creativecommons.org/licenses/by-sa/4.0/). See the Methods section of the [AlphaFold proteome paper]
+(https://www.nature.com/articles/s41586-021-03828-1) for details.
+
+*   [Uniclust30: v2018_08](http://wwwuser.gwdg.de/~compbiol/uniclust/2018_08/) (unmodified), by Mirdita M. et al., available under a [Creative Commons Attribution-ShareAlike 4.0 International License](http://creativecommons.org/licenses/by-sa/4.0/).
+
+*   [MGnify: v2018_12](http://ftp.ebi.ac.uk/pub/databases/metagenomics/peptide_database/current_release/README.txt) (unmodified), by Mitchell AL et al., available free of all copyright restrictions and made fully and freely available for both non-commercial and commercial use under [CC0 1.0 Universal (CC0 1.0) Public Domain Dedication](https://creativecommons.org/publicdomain/zero/1.0/).
--- a/alphafold/common/protein.py
+++ b/alphafold/common/protein.py
@@ -67,7 +67,7 @@ def from_pdb_string(pdb_str: str, chain_id: Optional[str] = None) -> Protein:
    A new `Protein` parsed from the pdb contents.
  """
  pdb_fh = io.StringIO(pdb_str)
-  parser = PDBParser()
+  parser = PDBParser(QUIET=True)
  structure = parser.get_structure('none', pdb_fh)
  models = list(structure.get_models())
  if len(models) != 1:
@@ -207,22 +207,25 @@ def ideal_atom_mask(prot: Protein) -> np.ndarray:
  return residue_constants.STANDARD_ATOM_MASK[prot.aatype]


-def from_prediction(features: FeatureDict, result: ModelOutput) -> Protein:
+def from_prediction(features: FeatureDict, result: ModelOutput,
+                    b_factors: Optional[np.ndarray] = None) -> Protein:
  """Assembles a protein from a prediction.

  Args:
    features: Dictionary holding model inputs.
    result: Dictionary holding model outputs.
+    b_factors: (Optional) B-factors to use for the protein.

  Returns:
    A protein instance.
  """
  fold_output = result['structure_module']
-  dist_per_residue = np.zeros_like(fold_output['final_atom_mask'])
+  if b_factors is None:
+    b_factors = np.zeros_like(fold_output['final_atom_mask'])

  return Protein(
      aatype=features['aatype'][0],
      atom_positions=fold_output['final_atom_positions'],
      atom_mask=fold_output['final_atom_mask'],
      residue_index=features['residue_index'][0] + 1,
-      b_factors=dist_per_residue)
+      b_factors=b_factors)
--- a/alphafold/data/parsers.py
+++ b/alphafold/data/parsers.py
@@ -16,7 +16,7 @@
 import collections
 import re
 import string
-from typing import Iterable, List, Optional, Sequence, Tuple
+from typing import Iterable, List, Optional, Sequence, Tuple, Dict

 import dataclasses

@@ -24,23 +24,14 @@ DeletionMatrix = Sequence[Sequence[int]]


 @dataclasses.dataclass(frozen=True)
-class HhrHit:
-  """Class representing a hit in an hhr file."""
+class TemplateHit:
+  """Class representing a template hit."""
  index: int
  name: str
-  prob_true: float
-  e_value: float
-  score: float
  aligned_cols: int
-  identity: float
-  similarity: float
  sum_probs: float
-  neff: float
  query: str
  hit_sequence: str
-  hit_dssp: str
-  column_score_code: str
-  confidence_scores: str
  indices_query: List[int]
  indices_hit: List[int]

@@ -75,7 +66,8 @@ def parse_fasta(fasta_string: str) -> Tuple[Sequence[str], Sequence[str]]:


 def parse_stockholm(
-    stockholm_string: str) -> Tuple[Sequence[str], DeletionMatrix]:
+    stockholm_string: str
+) -> Tuple[Sequence[str], DeletionMatrix, Sequence[str]]:
  """Parses sequences and deletion matrix from stockholm format alignment.

  Args:
@@ -89,6 +81,8 @@ def parse_stockholm(
      * The deletion matrix for the alignment as a list of lists. The element
        at `deletion_matrix[i][j]` is the number of residues deleted from
        the aligned sequence i at residue position j.
+      * The names of the targets matched, including the jackhmmer subsequence
+        suffix.
  """
  name_to_sequence = collections.OrderedDict()
  for line in stockholm_string.splitlines():
@@ -128,7 +122,7 @@ def parse_stockholm(
          deletion_count = 0
    deletion_matrix.append(deletion_vec)

-  return msa, deletion_matrix
+  return msa, deletion_matrix, list(name_to_sequence.keys())


 def parse_a3m(a3m_string: str) -> Tuple[Sequence[str], DeletionMatrix]:
@@ -242,7 +236,7 @@ def _update_hhr_residue_indices_list(
      counter += 1


-def _parse_hhr_hit(detailed_lines: Sequence[str]) -> HhrHit:
+def _parse_hhr_hit(detailed_lines: Sequence[str]) -> TemplateHit:
  """Parses the detailed HMM HMM comparison section for a single Hit.

  This works on .hhr files generated from both HHBlits and HHSearch.
@@ -271,7 +265,7 @@ def _parse_hhr_hit(detailed_lines: Sequence[str]) -> HhrHit:
    raise RuntimeError(
        'Could not parse section: %s. Expected this: \n%s to contain summary.' %
        (detailed_lines, detailed_lines[2]))
-  (prob_true, e_value, score, aligned_cols, identity, similarity, sum_probs,
+  (prob_true, e_value, _, aligned_cols, _, _, sum_probs,
   neff) = [float(x) for x in match.groups()]

  # The next section reads the detailed comparisons. These are in a 'human
@@ -280,9 +274,6 @@ def _parse_hhr_hit(detailed_lines: Sequence[str]) -> HhrHit:
  # that with a regexp in order to deduce the fixed length used for that block.
  query = ''
  hit_sequence = ''
-  hit_dssp = ''
-  column_score_code = ''
-  confidence_scores = ''
  indices_query = []
  indices_hit = []
  length_block = None
@@ -312,17 +303,10 @@ def _parse_hhr_hit(detailed_lines: Sequence[str]) -> HhrHit:
      _update_hhr_residue_indices_list(delta_query, start, indices_query)

    elif line.startswith('T '):
-      # Parse the hit dssp line.
-      if line.startswith('T ss_dssp'):
-        #        T ss_dssp      hit_dssp
-        patt = r'T ss_dssp[\t ]*([A-Z-]*)'
-        groups = _get_hhr_line_regex_groups(patt, line)
-        assert len(groups[0]) == length_block
-        hit_dssp += groups[0]
-
      # Parse the hit sequence.
-      elif (not line.startswith('T ss_pred') and
-            not line.startswith('T Consensus')):
+      if (not line.startswith('T ss_dssp') and
+          not line.startswith('T ss_pred') and
+          not line.startswith('T Consensus')):
        # Thus the first 17 characters must be 'T <hit_name> ', and we can
        # parse everything after that.
        #              start    sequence       end     total_sequence_length
@@ -336,38 +320,19 @@ def _parse_hhr_hit(detailed_lines: Sequence[str]) -> HhrHit:
        hit_sequence += delta_hit_sequence
        _update_hhr_residue_indices_list(delta_hit_sequence, start, indices_hit)

-    # Parse the column score line.
-    elif line.startswith(' ' * 22):
-      assert length_block
-      column_score_code += line[22:length_block + 22]
-
-    # Update confidence score.
-    elif line.startswith('Confidence'):
-      assert length_block
-      confidence_scores += line[22:length_block + 22]
-
-  return HhrHit(
+  return TemplateHit(
      index=number_of_hit,
      name=name_hit,
-      prob_true=prob_true,
-      e_value=e_value,
-      score=score,
      aligned_cols=int(aligned_cols),
-      identity=identity,
-      similarity=similarity,
      sum_probs=sum_probs,
-      neff=neff,
      query=query,
      hit_sequence=hit_sequence,
-      hit_dssp=hit_dssp,
-      column_score_code=column_score_code,
-      confidence_scores=confidence_scores,
      indices_query=indices_query,
      indices_hit=indices_hit,
  )


-def parse_hhr(hhr_string: str) -> Sequence[HhrHit]:
+def parse_hhr(hhr_string: str) -> Sequence[TemplateHit]:
  """Parses the content of an entire HHR file."""
  lines = hhr_string.splitlines()

@@ -383,3 +348,18 @@ def parse_hhr(hhr_string: str) -> Sequence[HhrHit]:
    for i in range(len(block_starts) - 1):
      hits.append(_parse_hhr_hit(lines[block_starts[i]:block_starts[i + 1]]))
  return hits
+
+
+def parse_e_values_from_tblout(tblout: str) -> Dict[str, float]:
+  """Parse target to e-value mapping parsed from Jackhmmer tblout string."""
+  e_values = {'query': 0}
+  lines = [line for line in tblout.splitlines() if line[0] != '#']
+  # As per http://eddylab.org/software/hmmer/Userguide.pdf fields are
+  # space-delimited. Relevant fields are (1) target name:  and
+  # (5) E-value (full sequence) (numbering from 1).
+  for line in lines:
+    fields = line.split()
+    e_value = fields[4]
+    target_name = fields[0]
+    e_values[target_name] = float(e_value)
+  return e_values
--- a/alphafold/data/pipeline.py
+++ b/alphafold/data/pipeline.py
@@ -15,7 +15,7 @@
 """Functions for building the input features for the AlphaFold model."""

 import os
-from typing import Mapping, Sequence
+from typing import Mapping, Optional, Sequence

 import numpy as np

@@ -88,19 +88,27 @@ class DataPipeline:
               hhsearch_binary_path: str,
               uniref90_database_path: str,
               mgnify_database_path: str,
-               bfd_database_path: str,
-               uniclust30_database_path: str,
+               bfd_database_path: Optional[str],
+               uniclust30_database_path: Optional[str],
+               small_bfd_database_path: Optional[str],
               pdb70_database_path: str,
               template_featurizer: templates.TemplateHitFeaturizer,
+               use_small_bfd: bool,
               mgnify_max_hits: int = 501,
               uniref_max_hits: int = 10000):
    """Constructs a feature dict for a given FASTA file."""
+    self._use_small_bfd = use_small_bfd
    self.jackhmmer_uniref90_runner = jackhmmer.Jackhmmer(
        binary_path=jackhmmer_binary_path,
        database_path=uniref90_database_path)
-    self.hhblits_bfd_uniclust_runner = hhblits.HHBlits(
-        binary_path=hhblits_binary_path,
-        databases=[bfd_database_path, uniclust30_database_path])
+    if use_small_bfd:
+      self.jackhmmer_small_bfd_runner = jackhmmer.Jackhmmer(
+          binary_path=jackhmmer_binary_path,
+          database_path=small_bfd_database_path)
+    else:
+      self.hhblits_bfd_uniclust_runner = hhblits.HHBlits(
+          binary_path=hhblits_binary_path,
+          databases=[bfd_database_path, uniclust30_database_path])
    self.jackhmmer_mgnify_runner = jackhmmer.Jackhmmer(
        binary_path=jackhmmer_binary_path,
        database_path=mgnify_database_path)
@@ -124,9 +132,9 @@ class DataPipeline:
    num_res = len(input_sequence)

    jackhmmer_uniref90_result = self.jackhmmer_uniref90_runner.query(
-        input_fasta_path)
+        input_fasta_path)[0]
    jackhmmer_mgnify_result = self.jackhmmer_mgnify_runner.query(
-        input_fasta_path)
+        input_fasta_path)[0]

    uniref90_msa_as_a3m = parsers.convert_stockholm_to_a3m(
        jackhmmer_uniref90_result['sto'], max_sequences=self.uniref_max_hits)
@@ -140,29 +148,40 @@ class DataPipeline:
    with open(mgnify_out_path, 'w') as f:
      f.write(jackhmmer_mgnify_result['sto'])

-    uniref90_msa, uniref90_deletion_matrix = parsers.parse_stockholm(
+    uniref90_msa, uniref90_deletion_matrix, _ = parsers.parse_stockholm(
        jackhmmer_uniref90_result['sto'])
-    mgnify_msa, mgnify_deletion_matrix = parsers.parse_stockholm(
+    mgnify_msa, mgnify_deletion_matrix, _ = parsers.parse_stockholm(
        jackhmmer_mgnify_result['sto'])
    hhsearch_hits = parsers.parse_hhr(hhsearch_result)
    mgnify_msa = mgnify_msa[:self.mgnify_max_hits]
    mgnify_deletion_matrix = mgnify_deletion_matrix[:self.mgnify_max_hits]

-    hhblits_bfd_uniclust_result = self.hhblits_bfd_uniclust_runner.query(
-        input_fasta_path)
+    if self._use_small_bfd:
+      jackhmmer_small_bfd_result = self.jackhmmer_small_bfd_runner.query(
+          input_fasta_path)[0]

-    bfd_out_path = os.path.join(msa_output_dir, 'bfd_uniclust_hits.a3m')
-    with open(bfd_out_path, 'w') as f:
-      f.write(hhblits_bfd_uniclust_result['a3m'])
+      bfd_out_path = os.path.join(msa_output_dir, 'small_bfd_hits.a3m')
+      with open(bfd_out_path, 'w') as f:
+        f.write(jackhmmer_small_bfd_result['sto'])

-    bfd_msa, bfd_deletion_matrix = parsers.parse_a3m(
-        hhblits_bfd_uniclust_result['a3m'])
+      bfd_msa, bfd_deletion_matrix, _ = parsers.parse_stockholm(
+          jackhmmer_small_bfd_result['sto'])
+    else:
+      hhblits_bfd_uniclust_result = self.hhblits_bfd_uniclust_runner.query(
+          input_fasta_path)
+
+      bfd_out_path = os.path.join(msa_output_dir, 'bfd_uniclust_hits.a3m')
+      with open(bfd_out_path, 'w') as f:
+        f.write(hhblits_bfd_uniclust_result['a3m'])
+
+      bfd_msa, bfd_deletion_matrix = parsers.parse_a3m(
+          hhblits_bfd_uniclust_result['a3m'])

    templates_result = self.template_featurizer.get_templates(
        query_sequence=input_sequence,
        query_pdb_code=None,
        query_release_date=None,
-        hhr_hits=hhsearch_hits)
+        hits=hhsearch_hits)

    sequence_features = make_sequence_features(
        sequence=input_sequence,

--- a/alphafold/data/templates.py
+++ b/alphafold/data/templates.py
@@ -93,19 +93,12 @@ TEMPLATE_FEATURES = {
    'template_all_atom_masks': np.float32,
    'template_all_atom_positions': np.float32,
    'template_domain_names': np.object,
-    'template_e_value': np.float32,
-    'template_neff': np.float32,
-    'template_prob_true': np.float32,
-    'template_release_date': np.object,
-    'template_score': np.float32,
-    'template_similarity': np.float32,
    'template_sequence': np.object,
    'template_sum_probs': np.float32,
-    'template_confidence_scores': np.int64
 }


-def _get_pdb_id_and_chain(hit: parsers.HhrHit) -> Tuple[str, str]:
+def _get_pdb_id_and_chain(hit: parsers.TemplateHit) -> Tuple[str, str]:
  """Returns PDB id and chain id for an HHSearch Hit."""
  # PDB ID: 4 letters. Chain ID: 1+ alphanumeric letters or "." if unknown.
  id_match = re.match(r'[a-zA-Z\d]{4}_[a-zA-Z0-9.]+', hit.name)
@@ -175,7 +168,7 @@ def _parse_release_dates(path: str) -> Mapping[str, datetime.datetime]:


 def _assess_hhsearch_hit(
-    hit: parsers.HhrHit,
+    hit: parsers.TemplateHit,
    hit_pdb_code: str,
    query_sequence: str,
    query_pdb_code: Optional[str],
@@ -487,7 +480,6 @@ def _extract_template_features(
    template_sequence: str,
    query_sequence: str,
    template_chain_id: str,
-    confidence_scores: str,
    kalign_binary_path: str) -> Tuple[Dict[str, Any], Optional[str]]:
  """Parses atom positions in the target structure and aligns with the query.

@@ -495,21 +487,6 @@ def _extract_template_features(
  with their corresponding residue in the query sequence, according to the
  alignment mapping provided.

-  Note that we only extract at most 500 templates because of HHSearch settings.
-
-  We set missing/invalid confidence scores to the default value of -1.
-  Note: We now have 4 types of confidence scores:
-   1. Valid scores
-   2. Invalid scores of residues not in both the query sequence and template
-      sequence
-   3. Missing scores because we don't have the secondary structure, and HHAlign
-      doesn't produce the posterior probabilities in this case.
-   4. Missing scores because of a different template sequence in PDB70,
-      invalidating the previously computed confidence scores. (Though in theory
-      HHAlign can be run on these to recompute the correct confidence scores).
-   We handle invalid and missing scores by setting them to -1, but consider
-   adding masks for the different types.
-
  Args:
    mmcif_object: mmcif_parsing.MmcifObject representing the template.
    pdb_id: PDB code for the template.
@@ -521,11 +498,6 @@ def _extract_template_features(
      protein.
    template_chain_id: String ID describing which chain in the structure proto
      should be used.
-    confidence_scores: String containing per-residue confidence scores, where
-      each character represents the *TRUNCATED* posterior probability that the
-      corresponding template residue is correctly aligned with the query
-      residue, given the database match is correct (0 corresponds approximately
-      to 0-10%, 9 to 90-100%).
    kalign_binary_path: The path to a kalign executable used for template
        realignment.

@@ -577,8 +549,6 @@ def _extract_template_features(
    template_sequence = seqres
    # No mapping offset, the query is aligned to the actual sequence.
    mapping_offset = 0
-    # Confidence scores were based on the previous sequence, so they are invalid
-    confidence_scores = None

  try:
    # Essentially set to infinity - we don't want to reject templates unless
@@ -594,7 +564,6 @@ def _extract_template_features(
  all_atom_masks = np.split(all_atom_mask, all_atom_mask.shape[0])

  output_templates_sequence = []
-  output_confidence_scores = []
  templates_all_atom_positions = []
  templates_all_atom_masks = []

@@ -604,15 +573,12 @@ def _extract_template_features(
        np.zeros((residue_constants.atom_type_num, 3)))
    templates_all_atom_masks.append(np.zeros(residue_constants.atom_type_num))
    output_templates_sequence.append('-')
-    output_confidence_scores.append(-1)

  for k, v in mapping.items():
    template_index = v + mapping_offset
    templates_all_atom_positions[k] = all_atom_positions[template_index][0]
    templates_all_atom_masks[k] = all_atom_masks[template_index][0]
    output_templates_sequence[k] = template_sequence[v]
-    if confidence_scores and confidence_scores[v] != ' ':
-      output_confidence_scores[k] = int(confidence_scores[v])

  # Alanine (AA with the lowest number of atoms) has 5 atoms (C, CA, CB, N, O).
  if np.sum(templates_all_atom_masks) < 5:
@@ -627,13 +593,13 @@ def _extract_template_features(
      output_templates_sequence, residue_constants.HHBLITS_AA_TO_ID)

  return (
-      {'template_all_atom_positions': np.array(templates_all_atom_positions),
-       'template_all_atom_masks': np.array(templates_all_atom_masks),
-       'template_sequence': output_templates_sequence.encode(),
-       'template_aatype': np.array(templates_aatype),
-       'template_confidence_scores': np.array(output_confidence_scores),
-       'template_domain_names': f'{pdb_id.lower()}_{chain_id}'.encode(),
-       'template_release_date': mmcif_object.header['release_date'].encode()},
+      {
+          'template_all_atom_positions': np.array(templates_all_atom_positions),
+          'template_all_atom_masks': np.array(templates_all_atom_masks),
+          'template_sequence': output_templates_sequence.encode(),
+          'template_aatype': np.array(templates_aatype),
+          'template_domain_names': f'{pdb_id.lower()}_{chain_id}'.encode(),
+      },
      warning)


@@ -704,7 +670,7 @@ class SingleHitResult:
 def _process_single_hit(
    query_sequence: str,
    query_pdb_code: Optional[str],
-    hit: parsers.HhrHit,
+    hit: parsers.TemplateHit,
    mmcif_dir: str,
    max_template_date: datetime.datetime,
    release_dates: Mapping[str, datetime.datetime],
@@ -745,9 +711,6 @@ def _process_single_hit(
  # The mapping is from the query to the actual hit sequence, so we need to
  # remove gaps (which regardless have a missing confidence score).
  template_sequence = hit.hit_sequence.replace('-', '')
-  confidence_scores = ''.join(
-      [cs for t, cs in zip(hit.hit_sequence, hit.confidence_scores)
-       if t != '-'])

  cif_path = os.path.join(mmcif_dir, hit_pdb_code + '.cif')
  logging.info('Reading PDB entry from %s. Query: %s, template: %s',
@@ -779,14 +742,8 @@ def _process_single_hit(
        template_sequence=template_sequence,
        query_sequence=query_sequence,
        template_chain_id=hit_chain_id,
-        confidence_scores=confidence_scores,
        kalign_binary_path=kalign_binary_path)
-    features['template_e_value'] = [hit.e_value]
    features['template_sum_probs'] = [hit.sum_probs]
-    features['template_prob_true'] = [hit.prob_true]
-    features['template_score'] = [hit.score]
-    features['template_neff'] = [hit.neff]
-    features['template_similarity'] = [hit.similarity]

    # It is possible there were some errors when parsing the other chains in the
    # mmCIF file, but the template features for the chain we want were still
@@ -887,7 +844,7 @@ class TemplateHitFeaturizer:
      query_sequence: str,
      query_pdb_code: Optional[str],
      query_release_date: Optional[datetime.datetime],
-      hhr_hits: Sequence[parsers.HhrHit]) -> TemplateSearchResult:
+      hits: Sequence[parsers.TemplateHit]) -> TemplateSearchResult:
    """Computes the templates for given query sequence (more details above)."""
    logging.info('Searching for template for: %s', query_pdb_code)

@@ -909,8 +866,8 @@ class TemplateHitFeaturizer:
    errors = []
    warnings = []

-    for hit in sorted(hhr_hits, key=lambda x: x.sum_probs, reverse=True):
-      # We got all the templates we wanted, stop processing HHSearch hits.
+    for hit in sorted(hits, key=lambda x: x.sum_probs, reverse=True):
+      # We got all the templates we wanted, stop processing hits.
      if num_hits >= self._max_hits:
        break


--- a/alphafold/data/tools/__init__.py
+++ b/alphafold/data/tools/__init__.py
+# Copyright 2021 DeepMind Technologies Limited
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#      http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Python wrappers for third party tools."""
--- a/alphafold/data/tools/hmmbuild.py
+++ b/alphafold/data/tools/hmmbuild.py
+# Copyright 2021 DeepMind Technologies Limited
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#      http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""A Python wrapper for hmmbuild - construct HMM profiles from MSA."""
+
+import os
+import re
+import subprocess
+
+from absl import logging
+
+# Internal import (7716).
+
+from alphafold.data.tools import utils
+
+
+class Hmmbuild(object):
+  """Python wrapper of the hmmbuild binary."""
+
+  def __init__(self,
+               *,
+               binary_path: str,
+               singlemx: bool = False):
+    """Initializes the Python hmmbuild wrapper.
+
+    Args:
+      binary_path: The path to the hmmbuild executable.
+      singlemx: Whether to use --singlemx flag. If True, it forces HMMBuild to
+        just use a common substitution score matrix.
+
+    Raises:
+      RuntimeError: If hmmbuild binary not found within the path.
+    """
+    self.binary_path = binary_path
+    self.singlemx = singlemx
+
+  def build_profile_from_sto(self, sto: str, model_construction='fast') -> str:
+    """Builds a HHM for the aligned sequences given as an A3M string.
+
+    Args:
+      sto: A string with the aligned sequences in the Stockholm format.
+      model_construction: Whether to use reference annotation in the msa to
+        determine consensus columns ('hand') or default ('fast').
+
+    Returns:
+      A string with the profile in the HMM format.
+
+    Raises:
+      RuntimeError: If hmmbuild fails.
+    """
+    return self._build_profile(sto, model_construction=model_construction)
+
+  def build_profile_from_a3m(self, a3m: str) -> str:
+    """Builds a HHM for the aligned sequences given as an A3M string.
+
+    Args:
+      a3m: A string with the aligned sequences in the A3M format.
+
+    Returns:
+      A string with the profile in the HMM format.
+
+    Raises:
+      RuntimeError: If hmmbuild fails.
+    """
+    lines = []
+    for line in a3m.splitlines():
+      if not line.startswith('>'):
+        line = re.sub('[a-z]+', '', line)  # Remove inserted residues.
+      lines.append(line + '\n')
+    msa = ''.join(lines)
+    return self._build_profile(msa, model_construction='fast')
+
+  def _build_profile(self, msa: str, model_construction: str = 'fast') -> str:
+    """Builds a HMM for the aligned sequences given as an MSA string.
+
+    Args:
+      msa: A string with the aligned sequences, in A3M or STO format.
+      model_construction: Whether to use reference annotation in the msa to
+        determine consensus columns ('hand') or default ('fast').
+
+    Returns:
+      A string with the profile in the HMM format.
+
+    Raises:
+      RuntimeError: If hmmbuild fails.
+      ValueError: If unspecified arguments are provided.
+    """
+    if model_construction not in {'hand', 'fast'}:
+      raise ValueError(f'Invalid model_construction {model_construction} - only'
+                       'hand and fast supported.')
+
+    with utils.tmpdir_manager(base_dir='/tmp') as query_tmp_dir:
+      input_query = os.path.join(query_tmp_dir, 'query.msa')
+      output_hmm_path = os.path.join(query_tmp_dir, 'output.hmm')
+
+      with open(input_query, 'w') as f:
+        f.write(msa)
+
+      cmd = [self.binary_path]
+      # If adding flags, we have to do so before the output and input:
+
+      if model_construction == 'hand':
+        cmd.append(f'--{model_construction}')
+      if self.singlemx:
+        cmd.append('--singlemx')
+      cmd.extend([
+          '--amino',
+          output_hmm_path,
+          input_query,
+      ])
+
+      logging.info('Launching subprocess %s', cmd)
+      process = subprocess.Popen(cmd, stdout=subprocess.PIPE,
+                                 stderr=subprocess.PIPE)
+
+      with utils.timing('hmmbuild query'):
+        stdout, stderr = process.communicate()
+        retcode = process.wait()
+        logging.info('hmmbuild stdout:\n%s\n\nstderr:\n%s\n',
+                     stdout.decode('utf-8'), stderr.decode('utf-8'))
+
+      if retcode:
+        raise RuntimeError('hmmbuild failed\nstdout:\n%s\n\nstderr:\n%s\n'
+                           % (stdout.decode('utf-8'), stderr.decode('utf-8')))
+
+      with open(output_hmm_path, encoding='utf-8') as f:
+        hmm = f.read()
+
+    return hmm
--- a/alphafold/data/tools/hmmsearch.py
+++ b/alphafold/data/tools/hmmsearch.py
+# Copyright 2021 DeepMind Technologies Limited
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#      http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""A Python wrapper for hmmsearch - search profile against a sequence db."""
+
+import os
+import subprocess
+from typing import Optional, Sequence
+
+from absl import logging
+
+# Internal import (7716).
+
+from alphafold.data.tools import utils
+
+
+class Hmmsearch(object):
+  """Python wrapper of the hmmsearch binary."""
+
+  def __init__(self,
+               *,
+               binary_path: str,
+               database_path: str,
+               flags: Optional[Sequence[str]] = None):
+    """Initializes the Python hmmsearch wrapper.
+
+    Args:
+      binary_path: The path to the hmmsearch executable.
+      database_path: The path to the hmmsearch database (FASTA format).
+      flags: List of flags to be used by hmmsearch.
+
+    Raises:
+      RuntimeError: If hmmsearch binary not found within the path.
+    """
+    self.binary_path = binary_path
+    self.database_path = database_path
+    self.flags = flags
+
+    if not os.path.exists(self.database_path):
+      logging.error('Could not find hmmsearch database %s', database_path)
+      raise ValueError(f'Could not find hmmsearch database {database_path}')
+
+  def query(self, hmm: str) -> str:
+    """Queries the database using hmmsearch using a given hmm."""
+    with utils.tmpdir_manager(base_dir='/tmp') as query_tmp_dir:
+      hmm_input_path = os.path.join(query_tmp_dir, 'query.hmm')
+      a3m_out_path = os.path.join(query_tmp_dir, 'output.a3m')
+      with open(hmm_input_path, 'w') as f:
+        f.write(hmm)
+
+      cmd = [
+          self.binary_path,
+          '--noali',  # Don't include the alignment in stdout.
+          '--cpu', '8'
+      ]
+      # If adding flags, we have to do so before the output and input:
+      if self.flags:
+        cmd.extend(self.flags)
+      cmd.extend([
+          '-A', a3m_out_path,
+          hmm_input_path,
+          self.database_path,
+      ])
+
+      logging.info('Launching sub-process %s', cmd)
+      process = subprocess.Popen(
+          cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
+      with utils.timing(
+          f'hmmsearch ({os.path.basename(self.database_path)}) query'):
+        stdout, stderr = process.communicate()
+        retcode = process.wait()
+
+      if retcode:
+        raise RuntimeError(
+            'hmmsearch failed:\nstdout:\n%s\n\nstderr:\n%s\n' % (
+                stdout.decode('utf-8'), stderr.decode('utf-8')))
+
+      with open(a3m_out_path) as f:
+        a3m_out = f.read()
+
+    return a3m_out
--- a/alphafold/data/tools/jackhmmer.py
+++ b/alphafold/data/tools/jackhmmer.py
@@ -14,9 +14,12 @@

 """Library to run Jackhmmer from Python."""

+from concurrent import futures
+import glob
 import os
 import subprocess
-from typing import Any, Mapping, Optional
+from typing import Any, Callable, Mapping, Optional, Sequence
+from urllib import request

 from absl import logging

@@ -40,7 +43,9 @@ class Jackhmmer:
               filter_f2: float = 0.00005,
               filter_f3: float = 0.0000005,
               incdom_e: Optional[float] = None,
-               dom_e: Optional[float] = None):
+               dom_e: Optional[float] = None,
+               num_streamed_chunks: Optional[int] = None,
+               streaming_callback: Optional[Callable[[int], None]] = None):
    """Initializes the Python Jackhmmer wrapper.

    Args:
@@ -57,11 +62,15 @@ class Jackhmmer:
      incdom_e: Domain e-value criteria for inclusion of domains in MSA/next
        round.
      dom_e: Domain e-value criteria for inclusion in tblout.
+      num_streamed_chunks: Number of database chunks to stream over.
+      streaming_callback: Callback function run after each chunk iteration with
+        the iteration number as argument.
    """
    self.binary_path = binary_path
    self.database_path = database_path
+    self.num_streamed_chunks = num_streamed_chunks

-    if not os.path.exists(self.database_path):
+    if not os.path.exists(self.database_path) and num_streamed_chunks is None:
      logging.error('Could not find Jackhmmer database %s', database_path)
      raise ValueError(f'Could not find Jackhmmer database {database_path}')

@@ -75,9 +84,11 @@ class Jackhmmer:
    self.incdom_e = incdom_e
    self.dom_e = dom_e
    self.get_tblout = get_tblout
+    self.streaming_callback = streaming_callback

-  def query(self, input_fasta_path: str) -> Mapping[str, Any]:
-    """Queries the database using Jackhmmer."""
+  def _query_chunk(self, input_fasta_path: str, database_path: str
+                   ) -> Mapping[str, Any]:
+    """Queries the database chunk using Jackhmmer."""
    with utils.tmpdir_manager(base_dir='/tmp') as query_tmp_dir:
      sto_path = os.path.join(query_tmp_dir, 'output.sto')

@@ -114,13 +125,13 @@ class Jackhmmer:
        cmd_flags.extend(['--incdomE', str(self.incdom_e)])

      cmd = [self.binary_path] + cmd_flags + [input_fasta_path,
-                                              self.database_path]
+                                              database_path]

      logging.info('Launching subprocess "%s"', ' '.join(cmd))
      process = subprocess.Popen(
          cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
      with utils.timing(
-          f'Jackhmmer ({os.path.basename(self.database_path)}) query'):
+          f'Jackhmmer ({os.path.basename(database_path)}) query'):
        _, stderr = process.communicate()
        retcode = process.wait()

@@ -145,3 +156,43 @@ class Jackhmmer:
        e_value=self.e_value)

    return raw_output
+
+  def query(self, input_fasta_path: str) -> Sequence[Mapping[str, Any]]:
+    """Queries the database using Jackhmmer."""
+    if self.num_streamed_chunks is None:
+      return [self._query_chunk(input_fasta_path, self.database_path)]
+
+    db_basename = os.path.basename(self.database_path)
+    db_remote_chunk = lambda db_idx: f'{self.database_path}.{db_idx}'
+    db_local_chunk = lambda db_idx: f'/tmp/ramdisk/{db_basename}.{db_idx}'
+
+    # Remove existing files to prevent OOM
+    for f in glob.glob(db_local_chunk('[0-9]*')):
+      try:
+        os.remove(f)
+      except OSError:
+        print(f'OSError while deleting {f}')
+
+    # Download the (i+1)-th chunk while Jackhmmer is running on the i-th chunk
+    with futures.ThreadPoolExecutor(max_workers=2) as executor:
+      chunked_output = []
+      for i in range(1, self.num_streamed_chunks + 1):
+        # Copy the chunk locally
+        if i == 1:
+          future = executor.submit(
+              request.urlretrieve, db_remote_chunk(i), db_local_chunk(i))
+        if i < self.num_streamed_chunks:
+          next_future = executor.submit(
+              request.urlretrieve, db_remote_chunk(i+1), db_local_chunk(i+1))
+
+        # Run Jackhmmer with the chunk
+        future.result()
+        chunked_output.append(
+            self._query_chunk(input_fasta_path, db_local_chunk(i)))
+
+        # Remove the local copy of the chunk
+        os.remove(db_local_chunk(i))
+        future = next_future
+        if self.streaming_callback:
+          self.streaming_callback(i)
+    return chunked_output
--- a/alphafold/model/folding.py
+++ b/alphafold/model/folding.py
@@ -492,7 +492,7 @@ class StructureModule(hk.Module):
        is_training=is_training,
        safe_key=safe_key)

-    representations['structure_module'] = output['act']
+    ret['representations'] = {'structure_module': output['act']}

    ret['traj'] = output['affine'] * jnp.array([1.] * 4 +
                                               [c.position_scale] * 3)
@@ -514,7 +514,8 @@ class StructureModule(hk.Module):
    if self.compute_loss:
      return ret
    else:
-      no_loss_features = ['final_atom_positions', 'final_atom_mask']
+      no_loss_features = ['final_atom_positions', 'final_atom_mask',
+                          'representations']
      no_loss_ret = {k: ret[k] for k in no_loss_features}
      return no_loss_ret


--- a/alphafold/model/modules.py
+++ b/alphafold/model/modules.py
@@ -237,6 +237,10 @@ class AlphaFoldIteration(hk.Module):
        continue
      else:
        ret[name] = module(representations, batch, is_training)
+        if 'representations' in ret[name]:
+          # Extra representations from the head. Used by the structure module
+          # to provide activations for the PredictedLDDTHead.
+          representations.update(ret[name].pop('representations'))
      if compute_loss:
        total_loss += loss(module, head_config, ret, name)


--- a/alphafold/model/tf/__init__.py
+++ b/alphafold/model/tf/__init__.py
+# Copyright 2021 DeepMind Technologies Limited
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#      http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Alphafold model TensorFlow code."""
--- a/alphafold/model/tf/input_pipeline.py
+++ b/alphafold/model/tf/input_pipeline.py
@@ -146,13 +146,13 @@ def process_tensors_from_config(tensors, data_config):
    num_ensemble *= data_config.common.num_recycle + 1

  if isinstance(num_ensemble, tf.Tensor) or num_ensemble > 1:
-    dtype = tree.map_structure(lambda x: x.dtype,
-                               tensors_0)
+    fn_output_signature = tree.map_structure(
+        tf.TensorSpec.from_tensor, tensors_0)
    tensors = tf.map_fn(
        lambda x: wrap_ensemble_fn(tensors, x),
        tf.range(num_ensemble),
        parallel_iterations=1,
-        dtype=dtype)
+        fn_output_signature=fn_output_signature)
  else:
    tensors = tree.map_structure(lambda x: x[None],
                                 tensors_0)

--- a/alphafold/relax/amber_minimize.py
+++ b/alphafold/relax/amber_minimize.py
@@ -52,7 +52,7 @@ def _add_restraints(
    stiffness: unit.Unit,
    rset: str,
    exclude_residues: Sequence[int]):
-  """Adds a harmonic potential that restrains the end-to-end distance."""
+  """Adds a harmonic potential that restrains the system to a structure."""
  assert rset in ["non_hydrogen", "c_alpha"]

  force = openmm.CustomExternalForce(

--- a/alphafold/relax/amber_minimize_test.py
+++ b/alphafold/relax/amber_minimize_test.py
@@ -54,7 +54,6 @@ class AmberMinimizeTest(absltest.TestCase):
                                  max_attempts=1)

  def test_iterative_relax(self):
-    # This test can occasionally fail because of nondeterminism in OpenMM.
    prot = _load_test_protein(
        'alphafold/relax/testdata/with_violations.pdb'
        )

--- a/alphafold/relax/testdata/with_violations.pdb
+++ b/alphafold/relax/testdata/with_violations.pdb
--- a/alphafold/relax/utils.py
+++ b/alphafold/relax/utils.py
@@ -48,7 +48,7 @@ def overwrite_b_factors(pdb_str: str, bfactors: np.ndarray) -> str:
    raise ValueError(
        f'Invalid final dimension size for bfactors: {bfactors.shape[-1]}.')

-  parser = PDB.PDBParser()
+  parser = PDB.PDBParser(QUIET=True)
  handle = io.StringIO(pdb_str)
  structure = parser.get_structure('', handle)


--- a/docker/Dockerfile
+++ b/docker/Dockerfile
@@ -54,7 +54,8 @@ RUN conda update -qy conda \
      openmm=7.5.1 \
      cudatoolkit==${CUDA}.3 \
      pdbfixer \
-      pip
+      pip \
+      python=3.7

 COPY . /app/alphafold
 RUN wget -q -P /app/alphafold/alphafold/common/ \
@@ -67,7 +68,7 @@ RUN pip3 install --upgrade pip \
      https://storage.googleapis.com/jax-releases/jax_releases.html

 # Apply OpenMM patch.
-WORKDIR /opt/conda/lib/python3.8/site-packages
+WORKDIR /opt/conda/lib/python3.7/site-packages
 RUN patch -p0 < /app/alphafold/docker/openmm.patch

 # We need to run `ldconfig` first to ensure GPUs are visible, due to some quirk

--- a/docker/run_docker.py
+++ b/docker/run_docker.py
@@ -57,13 +57,17 @@ uniref90_database_path = os.path.join(

 # Path to the MGnify database for use by JackHMMER.
 mgnify_database_path = os.path.join(
-    DOWNLOAD_DIR, 'mgnify', 'mgy_clusters.fa')
+    DOWNLOAD_DIR, 'mgnify', 'mgy_clusters_2018_08.fa')

 # Path to the BFD database for use by HHblits.
 bfd_database_path = os.path.join(
    DOWNLOAD_DIR, 'bfd',
    'bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt')

+# Path to the Small BFD database for use by JackHMMER.
+small_bfd_database_path = os.path.join(
+    DOWNLOAD_DIR, 'small_bfd', 'bfd-first_non_consensus_sequences.fasta')
+
 # Path to the Uniclust30 database for use by HHblits.
 uniclust30_database_path = os.path.join(
    DOWNLOAD_DIR, 'uniclust30', 'uniclust30_2018_08', 'uniclust30_2018_08')
@@ -92,10 +96,11 @@ flags.DEFINE_string('max_template_date', None, 'Maximum template release date '
                    'to consider (ISO-8601 format - i.e. YYYY-MM-DD). '
                    'Important if folding historical test sets.')
 flags.DEFINE_enum('preset', 'full_dbs',
-                  ['full_dbs', 'casp14'],
-                  'Choose preset model configuration - no ensembling with '
-                  'uniref90 + bfd + uniclust30 (full_dbs), or '
-                  '8 model ensemblings with uniref90 + bfd + uniclust30 '
+                  ['reduced_dbs', 'full_dbs', 'casp14'],
+                  'Choose preset model configuration - no ensembling and '
+                  'smaller genetic database config (reduced_dbs), no '
+                  'ensembling and full genetic database config  (full_dbs) or '
+                  'full genetic database config and 8 model ensemblings '
                  '(casp14).')
 flags.DEFINE_boolean('benchmark', False, 'Run multiple JAX model evaluations '
                     'to obtain a timing that excludes the compilation time, '
@@ -131,14 +136,22 @@ def main(argv):
    target_fasta_paths.append(target_path)
  command_args.append(f'--fasta_paths={",".join(target_fasta_paths)}')

-  for name, path in [('uniref90_database_path', uniref90_database_path),
-                     ('mgnify_database_path', mgnify_database_path),
-                     ('uniclust30_database_path', uniclust30_database_path),
-                     ('bfd_database_path', bfd_database_path),
-                     ('pdb70_database_path', pdb70_database_path),
-                     ('data_dir', data_dir),
-                     ('template_mmcif_dir', template_mmcif_dir),
-                     ('obsolete_pdbs_path', obsolete_pdbs_path)]:
+  database_paths = [
+      ('uniref90_database_path', uniref90_database_path),
+      ('mgnify_database_path', mgnify_database_path),
+      ('pdb70_database_path', pdb70_database_path),
+      ('data_dir', data_dir),
+      ('template_mmcif_dir', template_mmcif_dir),
+      ('obsolete_pdbs_path', obsolete_pdbs_path),
+  ]
+  if FLAGS.preset == 'reduced_dbs':
+    database_paths.append(('small_bfd_database_path', small_bfd_database_path))
+  else:
+    database_paths.extend([
+        ('uniclust30_database_path', uniclust30_database_path),
+        ('bfd_database_path', bfd_database_path),
+    ])
+  for name, path in database_paths:
    if path:
      mount, target_path = _create_mount(name, path)
      mounts.append(mount)

--- a/notebooks/AlphaFold.ipynb
+++ b/notebooks/AlphaFold.ipynb