Merge tag 'v2.3.2'

f7451744 · zhuwenwen · 346a6479 · 3f317255 · f7451744 · f7451744
Commit f7451744 authored Nov 15, 2023 by zhuwenwen
20 changed files
--- a/.dockerignore
+++ b/.dockerignore
+.dockerignore
+docker/Dockerfile
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
+# How to Contribute
+
+We welcome small patches related to bug fixes and documentation, but we do not
+plan to make any major changes to this repository.
+
+## Contributor License Agreement
+
+Contributions to this project must be accompanied by a Contributor License
+Agreement. You (or your employer) retain the copyright to your contribution,
+this simply gives us permission to use and redistribute your contributions as
+part of the project. Head over to <https://cla.developers.google.com/> to see
+your current agreements on file or to sign a new one.
+
+You generally only need to submit a CLA once, so if you've already submitted one
+(even if it was for a different project), you probably don't need to do it
+again.
+
+## Code reviews
+
+All submissions, including submissions by project members, require review. We
+use GitHub pull requests for this purpose. Consult
+[GitHub Help](https://help.github.com/articles/about-pull-requests/) for more
+information on using pull requests.
--- a/LICENSE
+++ b/LICENSE
+
+                                 Apache License
+                           Version 2.0, January 2004
+                        http://www.apache.org/licenses/
+
+   TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
+
+   1. Definitions.
+
+      "License" shall mean the terms and conditions for use, reproduction,
+      and distribution as defined by Sections 1 through 9 of this document.
+
+      "Licensor" shall mean the copyright owner or entity authorized by
+      the copyright owner that is granting the License.
+
+      "Legal Entity" shall mean the union of the acting entity and all
+      other entities that control, are controlled by, or are under common
+      control with that entity. For the purposes of this definition,
+      "control" means (i) the power, direct or indirect, to cause the
+      direction or management of such entity, whether by contract or
+      otherwise, or (ii) ownership of fifty percent (50%) or more of the
+      outstanding shares, or (iii) beneficial ownership of such entity.
+
+      "You" (or "Your") shall mean an individual or Legal Entity
+      exercising permissions granted by this License.
+
+      "Source" form shall mean the preferred form for making modifications,
+      including but not limited to software source code, documentation
+      source, and configuration files.
+
+      "Object" form shall mean any form resulting from mechanical
+      transformation or translation of a Source form, including but
+      not limited to compiled object code, generated documentation,
+      and conversions to other media types.
+
+      "Work" shall mean the work of authorship, whether in Source or
+      Object form, made available under the License, as indicated by a
+      copyright notice that is included in or attached to the work
+      (an example is provided in the Appendix below).
+
+      "Derivative Works" shall mean any work, whether in Source or Object
+      form, that is based on (or derived from) the Work and for which the
+      editorial revisions, annotations, elaborations, or other modifications
+      represent, as a whole, an original work of authorship. For the purposes
+      of this License, Derivative Works shall not include works that remain
+      separable from, or merely link (or bind by name) to the interfaces of,
+      the Work and Derivative Works thereof.
+
+      "Contribution" shall mean any work of authorship, including
+      the original version of the Work and any modifications or additions
+      to that Work or Derivative Works thereof, that is intentionally
+      submitted to Licensor for inclusion in the Work by the copyright owner
+      or by an individual or Legal Entity authorized to submit on behalf of
+      the copyright owner. For the purposes of this definition, "submitted"
+      means any form of electronic, verbal, or written communication sent
+      to the Licensor or its representatives, including but not limited to
+      communication on electronic mailing lists, source code control systems,
+      and issue tracking systems that are managed by, or on behalf of, the
+      Licensor for the purpose of discussing and improving the Work, but
+      excluding communication that is conspicuously marked or otherwise
+      designated in writing by the copyright owner as "Not a Contribution."
+
+      "Contributor" shall mean Licensor and any individual or Legal Entity
+      on behalf of whom a Contribution has been received by Licensor and
+      subsequently incorporated within the Work.
+
+   2. Grant of Copyright License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      copyright license to reproduce, prepare Derivative Works of,
+      publicly display, publicly perform, sublicense, and distribute the
+      Work and such Derivative Works in Source or Object form.
+
+   3. Grant of Patent License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      (except as stated in this section) patent license to make, have made,
+      use, offer to sell, sell, import, and otherwise transfer the Work,
+      where such license applies only to those patent claims licensable
+      by such Contributor that are necessarily infringed by their
+      Contribution(s) alone or by combination of their Contribution(s)
+      with the Work to which such Contribution(s) was submitted. If You
+      institute patent litigation against any entity (including a
+      cross-claim or counterclaim in a lawsuit) alleging that the Work
+      or a Contribution incorporated within the Work constitutes direct
+      or contributory patent infringement, then any patent licenses
+      granted to You under this License for that Work shall terminate
+      as of the date such litigation is filed.
+
+   4. Redistribution. You may reproduce and distribute copies of the
+      Work or Derivative Works thereof in any medium, with or without
+      modifications, and in Source or Object form, provided that You
+      meet the following conditions:
+
+      (a) You must give any other recipients of the Work or
+          Derivative Works a copy of this License; and
+
+      (b) You must cause any modified files to carry prominent notices
+          stating that You changed the files; and
+
+      (c) You must retain, in the Source form of any Derivative Works
+          that You distribute, all copyright, patent, trademark, and
+          attribution notices from the Source form of the Work,
+          excluding those notices that do not pertain to any part of
+          the Derivative Works; and
+
+      (d) If the Work includes a "NOTICE" text file as part of its
+          distribution, then any Derivative Works that You distribute must
+          include a readable copy of the attribution notices contained
+          within such NOTICE file, excluding those notices that do not
+          pertain to any part of the Derivative Works, in at least one
+          of the following places: within a NOTICE text file distributed
+          as part of the Derivative Works; within the Source form or
+          documentation, if provided along with the Derivative Works; or,
+          within a display generated by the Derivative Works, if and
+          wherever such third-party notices normally appear. The contents
+          of the NOTICE file are for informational purposes only and
+          do not modify the License. You may add Your own attribution
+          notices within Derivative Works that You distribute, alongside
+          or as an addendum to the NOTICE text from the Work, provided
+          that such additional attribution notices cannot be construed
+          as modifying the License.
+
+      You may add Your own copyright statement to Your modifications and
+      may provide additional or different license terms and conditions
+      for use, reproduction, or distribution of Your modifications, or
+      for any such Derivative Works as a whole, provided Your use,
+      reproduction, and distribution of the Work otherwise complies with
+      the conditions stated in this License.
+
+   5. Submission of Contributions. Unless You explicitly state otherwise,
+      any Contribution intentionally submitted for inclusion in the Work
+      by You to the Licensor shall be under the terms and conditions of
+      this License, without any additional terms or conditions.
+      Notwithstanding the above, nothing herein shall supersede or modify
+      the terms of any separate license agreement you may have executed
+      with Licensor regarding such Contributions.
+
+   6. Trademarks. This License does not grant permission to use the trade
+      names, trademarks, service marks, or product names of the Licensor,
+      except as required for reasonable and customary use in describing the
+      origin of the Work and reproducing the content of the NOTICE file.
+
+   7. Disclaimer of Warranty. Unless required by applicable law or
+      agreed to in writing, Licensor provides the Work (and each
+      Contributor provides its Contributions) on an "AS IS" BASIS,
+      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
+      implied, including, without limitation, any warranties or conditions
+      of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
+      PARTICULAR PURPOSE. You are solely responsible for determining the
+      appropriateness of using or redistributing the Work and assume any
+      risks associated with Your exercise of permissions under this License.
+
+   8. Limitation of Liability. In no event and under no legal theory,
+      whether in tort (including negligence), contract, or otherwise,
+      unless required by applicable law (such as deliberate and grossly
+      negligent acts) or agreed to in writing, shall any Contributor be
+      liable to You for damages, including any direct, indirect, special,
+      incidental, or consequential damages of any character arising as a
+      result of this License or out of the use or inability to use the
+      Work (including but not limited to damages for loss of goodwill,
+      work stoppage, computer failure or malfunction, or any and all
+      other commercial damages or losses), even if such Contributor
+      has been advised of the possibility of such damages.
+
+   9. Accepting Warranty or Additional Liability. While redistributing
+      the Work or Derivative Works thereof, You may choose to offer,
+      and charge a fee for, acceptance of support, warranty, indemnity,
+      or other liability obligations and/or rights consistent with this
+      License. However, in accepting such obligations, You may act only
+      on Your own behalf and on Your sole responsibility, not on behalf
+      of any other Contributor, and only if You agree to indemnify,
+      defend, and hold each Contributor harmless for any liability
+      incurred by, or claims asserted against, such Contributor by reason
+      of your accepting any such warranty or additional liability.
+
+   END OF TERMS AND CONDITIONS
+
+   APPENDIX: How to apply the Apache License to your work.
+
+      To apply the Apache License to your work, attach the following
+      boilerplate notice, with the fields enclosed by brackets "[]"
+      replaced with your own identifying information. (Don't include
+      the brackets!)  The text should be enclosed in the appropriate
+      comment syntax for the file format. We also recommend that a
+      file or class name and description of purpose be included on the
+      same "printed page" as the copyright notice for easier
+      identification within third-party archives.
+
+   Copyright [yyyy] [name of copyright owner]
+
+   Licensed under the Apache License, Version 2.0 (the "License");
+   you may not use this file except in compliance with the License.
+   You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License.
--- a/README.md
+++ b/README.md
--- a/afdb/README.md
+++ b/afdb/README.md
+# AlphaFold Protein Structure Database
+
+## Introduction
+
+The AlphaFold UniProt release (214M predictions) is hosted on
+[Google Cloud Public Datasets](https://console.cloud.google.com/marketplace/product/bigquery-public-data/deepmind-alphafold),
+and is available to download at no cost under a
+[CC-BY-4.0 licence](http://creativecommons.org/licenses/by/4.0/legalcode). The
+dataset is in a Cloud Storage bucket, and metadata is available on BigQuery. A
+Google Cloud account is required for the download, but the data can be freely
+used under the terms of the
+[CC-BY 4.0 Licence](http://creativecommons.org/licenses/by/4.0/legalcode).
+
+This document provides an overview of how to access and download the dataset for
+different use cases. Please refer to the [AlphaFold database FAQ](https://www.alphafold.com/faq)
+for further information on what proteins are in the database and a changelog of
+releases.
+
+:ledger: **Note: The full dataset is difficult to manipulate without significant
+computational resources (the size of the dataset is 23 TiB, 3 * 214M files).**
+
+There are also alternatives to downloading the full dataset:
+
+1.  Download a premade subset (covering important species / Swiss-Prot) via our
+    [download page](https://alphafold.ebi.ac.uk/download).
+2.  Download a custom subset of the data. See below.
+
+If you need to download the full dataset then please see the "Bulk download"
+section. See "Creating a Google Cloud Account" below for more information on how
+to avoid any surprise costs when using Google Cloud Public Datasets.
+
+## Licence
+
+Data is available for academic and commercial use, under a
+[CC-BY-4.0 licence](http://creativecommons.org/licenses/by/4.0/legalcode).
+
+EMBL-EBI expects attribution (e.g. in publications, services or products) for
+any of its online services, databases or software in accordance with good
+scientific practice.
+
+If you make use of an AlphaFold prediction, please cite the following papers:
+
+*   [Jumper, J et al. Highly accurate protein structure prediction with
+    AlphaFold. Nature
+    (2021).](https://www.nature.com/articles/s41586-021-03819-2)
+*   [Varadi, M et al. AlphaFold Protein Structure Database: massively expanding
+    the structural coverage of protein-sequence space with high-accuracy models.
+    Nucleic Acids Research
+    (2021).](https://academic.oup.com/nar/advance-article/doi/10.1093/nar/gkab1061/6430488)
+
+AlphaFold Data Copyright (2022) DeepMind Technologies Limited.
+
+## Disclaimer
+
+The AlphaFold Data and other information provided on this site is for
+theoretical modelling only, caution should be exercised in its use. It is
+provided 'as-is' without any warranty of any kind, whether expressed or implied.
+For clarity, no warranty is given that use of the information shall not infringe
+the rights of any third party. The information is not intended to be a
+substitute for professional medical advice, diagnosis, or treatment, and does
+not constitute medical or other professional advice.
+
+## Format
+
+Dataset file names start with a protein identifier of the form `AF-[a UniProt
+accession]-F[a fragment number]`.
+
+Three files are provided for each entry:
+
+*   **model_v4.cif** – contains the atomic coordinates for the predicted protein
+    structure, along with some metadata. Useful references for this file format
+    are the [ModelCIF](https://github.com/ihmwg/ModelCIF) and
+    [PDBx/mmCIF](https://mmcif.wwpdb.org) project sites.
+*   **confidence_v4.json** – contains a confidence metric output by AlphaFold
+    called pLDDT. This provides a number for each residue, indicating how
+    confident AlphaFold is in the *local* surrounding structure. pLDDT ranges
+    from 0 to 100, where 100 is most confident. This is also contained in the
+    CIF file.
+*   **predicted_aligned_error_v4.json** – contains a confidence metric output by
+    AlphaFold called PAE. This provides a number for every pair of residues,
+    which is lower when AlphaFold is more confident in the relative position of
+    the two residues. PAE is more suitable than pLDDT for judging confidence in 
+    relative domain placements.
+    [See here](https://alphafold.ebi.ac.uk/faq#faq-7) for a description of the
+    format.
+
+Predictions grouped by NCBI taxonomy ID are available as
+`proteomes/proteome-tax_id-[TAX ID]-[SHARD ID]_v4.tar` within the same
+bucket.
+
+There are also two extra files stored in the bucket:
+
+*   `accession_ids.csv` – This file contains a list of all the UniProt
+    accessions that have predictions in AlphaFold DB. The file is in CSV format
+    and includes the following columns, separated by a comma:
+    *   UniProt accession, e.g. A8H2R3
+    *   First residue index (UniProt numbering), e.g. 1
+    *   Last residue index (UniProt numbering), e.g. 199
+    *   AlphaFold DB identifier, e.g. AF-A8H2R3-F1
+    *   Latest version, e.g. 4
+*   `sequences.fasta` – This file contains sequences for all proteins in the
+    current database version in FASTA format. The identifier rows start with
+    ">AFDB", followed by the AlphaFold DB identifier and the name of the
+    protein. The sequence rows contain the corresponding amino acid sequences.
+    Each sequence is on a single line, i.e. there is no wrapping.
+
+## Creating a Google Cloud Account
+
+Downloading from the Google Cloud Public Datasets (rather than from AFDB or 3D
+Beacons) requires a Google Cloud account. See the
+[Google Cloud get started](https://cloud.google.com/docs/get-started) page, and
+explore the [free tier account usage limits](https://cloud.google.com/free).
+
+**IMPORTANT: After the trial period has finished (90 days), to continue access,
+you are required to upgrade to a billing account. While your free tier access
+(including access to the Public Datasets storage bucket) continues, usage beyond
+the free tier will incur costs – please familiarise yourself with the pricing
+for the services that you use to avoid any surprises.**
+
+1.  Go to
+    [https://cloud.google.com/datasets](https://cloud.google.com/datasets).
+2.  Create an account:
+    1.  Click "get started for free" in the top right corner.
+    2.  Agree to all terms of service.
+    3.  Follow the setup instructions. Note that a payment method is required,
+        but this will not be used unless you enable billing.
+    4.  Access to the Google Cloud Public Datasets storage bucket is always at
+        no cost and you will have access to the
+        [free tier.](https://cloud.google.com/free/docs/gcp-free-tier#free-tier-usage-limits)
+3.  Set up a project:
+    1.  In the top left corner, click the navigation menu (three horizontal bars
+        icon).
+    2.  Select: "Cloud overview" -> "Dashboard".
+    3.  In the top left corner there is a project menu bar (likely says "My
+        First Project"). Select this and a "Select a Project" box will appear.
+    4.  To keep using this project, click "Cancel" at the bottom of the box.
+    5.  To create a new project, click "New Project" at the top of the box:
+        1.  Select a project name.
+        2.  For location, if your organization has a Cloud account then select
+            this, otherwise leave as is.
+4.  Install `gsutil`:
+    1.  Follow these
+        [instructions](https://cloud.google.com/storage/docs/gsutil_install).
+
+## Accessing the dataset
+
+The data is available from:
+
+*   GCS data bucket:
+    [gs://public-datasets-deepmind-alphafold-v4](https://console.cloud.google.com/storage/browser/public-datasets-deepmind-alphafold-v4)
+
+## Bulk download
+
+We don't recommend downloading the full dataset unless required for processing
+with local computational resources, for example in an academic high performance
+computing centre.
+
+We estimate that a 1 Gbps internet connection will allow download of the full
+database in roughly 2.5 days.
+
+While we don’t know the exact nature of your computational infrastructure, below
+are some suggested approaches for downloading the dataset. Please reach out to
+[alphafold@deepmind.com](mailto:alphafold@deepmind.com) if you have any
+questions.
+
+The recommended way of downloading the whole database is by downloading
+1,015,797 sharded proteome tar files using the command below. This is
+significantly faster than downloading all of the individual files because of
+large constant per-file latency.
+
+```bash
+gsutil -m cp -r gs://public-datasets-deepmind-alphafold-v4/proteomes/ .
+```
+
+You will then have to un-tar all of the proteomes and un-gzip all of the
+individual files. Note that after un-taring, there will be about 644M files, so
+make sure your filesystem can handle this.
+
+### Storage Transfer Service
+
+Some users might find the
+[Storage Transfer Service](https://cloud.google.com/storage-transfer-service) a
+convenient way to set up the transfer between this bucket and another bucket, or
+another cloud service. *Using this service may incur costs*. Please check the
+[pricing page](https://cloud.google.com/storage-transfer/pricing) for more
+detail, particularly for transfers to other cloud services.
+
+## Downloading subsets of the data
+
+### AlphaFold Database search
+
+For simple queries, for example by protein name, gene name or UniProt accession
+you can use the main search bar on
+[alphafold.ebi.ac.uk](https://alphafold.ebi.ac.uk).
+
+### 3D Beacons
+
+[3D-Beacons](https://3d-beacons.org) is an international collaboration of
+protein structure data providers to create a federated network with unified data
+access mechanisms. The 3D-Beacons platform allows users to retrieve coordinate
+files and metadata of experimentally determined and theoretical protein models
+from data providers such as AlphaFold DB.
+
+More information about how to access AlphaFold predictions using 3D-Beacons is
+available at
+[3D-Beacons documentation](https://www.ebi.ac.uk/pdbe/pdbe-kb/3dbeacons/docs).
+
+### Other premade species subsets
+
+Downloads for some model organism proteomes, global health proteomes and
+Swiss-Prot are available on the
+[AFDB website](https://alphafold.ebi.ac.uk/download). These are generated from
+[reference proteomes](https://www.uniprot.org/help/reference_proteome). If you
+want other species, or *all* proteins for a particular species, please continue
+reading.
+
+We provide 1,015,797 sharded tar files for all species in
+[gs://public-datasets-deepmind-alphafold-v4/proteomes/](https://console.cloud.google.com/storage/browser/public-datasets-deepmind-alphafold-v4/proteomes/).
+We shard each proteome so that each shard contains at most 10,000 proteins
+(which corresponds to 30,000 files per shard, since there are 3 files per
+protein). To download a proteome of your choice, you have to do the following
+steps:
+
+1.  Find the [NCBI taxonomy ID](https://www.ncbi.nlm.nih.gov/taxonomy)
+    (`[TAX_ID]`) of the species in question.
+2.  Run `gsutil -m cp
+    gs://public-datasets-deepmind-alphafold-v4/proteomes/proteome-tax_id-[TAX
+    ID]-*_v4.tar .` to download all shards for this proteome.
+3.  Un-tar all of the downloaded files and un-gzip all of the individual files.
+
+### File manifests
+
+Pre-made lists of files (manifests) are available at
+[gs://public-datasets-deepmind-alphafold-v4/manifests](https://console.cloud.google.com/storage/browser/public-datasets-deepmind-alphafold-v4/manifests/).
+Note that these filenames do not include the bucket prefix, but this can be
+added once the files have been downloaded to your filesystem.
+
+One can also define their own list of files, for example created by BigQuery
+(see below). `gsutil` can be used to download these files with
+
+```bash
+cat [manifest file] | gsutil -m cp -I .
+```
+
+This will be much slower than downloading the tar files (grouped by species)
+because each file has an associated overhead.
+
+### BigQuery
+
+**IMPORTANT: The
+[free tier](https://cloud.google.com/bigquery/pricing#free-tier) of Google Cloud
+comes with [BigQuery Sandbox](https://cloud.google.com/bigquery/docs/sandbox)
+with 1 TB of free processed query data each month. Repeated queries within a
+month could exceed this limit and if you have
+[upgraded to a paid Cloud Billing account](https://cloud.google.com/free/docs/gcp-free-tier#how-to-upgrade)
+you may be charged.**
+
+**This should be sufficient for running a number of queries on the metadata
+table, though the usage depends on the size of the columns queried and selected.
+Please look at the
+[BigQuery pricing page](https://cloud.google.com/bigquery/pricing) for more
+information.**
+
+**This is the user's responsibility so please ensure you keep track of your
+billing settings and resource usage in the console.**
+
+BigQuery provides a serverless and highly scalable analytics tool enabling SQL
+queries over large datasets. The metadata for the UniProt dataset takes up 113
+GiB and so can be challenging to process and analyse locally. The table name is:
+
+*   BigQuery metadata table:
+    [bigquery-public-data.deepmind_alphafold.metadata](https://console.cloud.google.com/bigquery?project=bigquery-public-data&ws=!1m5!1m4!4m3!1sbigquery-public-data!2sdeepmind_alphafold!3smetadata)
+
+With BigQuery SQL you can do complex queries, e.g. find all high accuracy
+predictions for a particular species, or even join on to other datasets, e.g. to
+an experimental dataset by the `uniprotSequence`, or to the NCBI taxonomy by
+`taxId`.
+
+If you would find additional information in the metadata useful please file a
+GitHub issue.
+
+#### Setup
+
+Follow the
+[BigQuery Sandbox set up guide](https://cloud.google.com/bigquery/docs/sandbox).
+
+#### Exploring the metadata
+
+The column names and associated data types available can be found using the
+following query.
+
+```sql
+SELECT column_name, data_type FROM bigquery-public-data.deepmind_alphafold.INFORMATION_SCHEMA.COLUMNS
+WHERE table_name = 'metadata'
+```
+
+**Column name**        | **Data type**   | **Description**
+---------------------- | --------------- | ---------------
+allVersions            | `ARRAY<INT64>`  | An array of AFDB versions this prediction has had
+entryId                | `STRING`        | The AFDB entry ID, e.g. "AF-Q1HGU3-F1"
+fractionPlddtConfident | `FLOAT64`       | Fraction of the residues in the prediction with pLDDT between 70 and 90
+fractionPlddtLow       | `FLOAT64`       | Fraction of the residues in the prediction with pLDDT between 50 and 70
+fractionPlddtVeryHigh  | `FLOAT64`       | Fraction of the residues in the prediction with pLDDT greater than 90
+fractionPlddtVeryLow   | `FLOAT64`       | Fraction of the residues in the prediction with pLDDT less than 50
+gene                   | `STRING`        | The name of the gene if known, e.g. "COII"
+geneSynonyms           | `ARRAY<STRING>` | Additional synonyms for the gene
+globalMetricValue      | `FLOAT64`       | The mean pLDDT of this prediction
+isReferenceProteome    | `BOOL`          | Is this protein part of the reference proteome?
+isReviewed             | `BOOL`          | Has this protein been reviewed, i.e. is it part of SwissProt?
+latestVersion          | `INT64`         | The latest AFDB version for this prediction
+modelCreatedDate       | `DATE`          | The date of creation for this entry, e.g. "2022-06-01"
+organismCommonNames    | `ARRAY<STRING>` | List of common organism names
+organismScientificName | `STRING`        | The scientific name of the organism
+organismSynonyms       | `ARRAY<STRING>` | List of synonyms for the organism
+proteinFullNames       | `ARRAY<STRING>` | Full names of the protein
+proteinShortNames      | `ARRAY<STRING>` | Short names of the protein
+sequenceChecksum       | `STRING`        | [CRC64 hash](https://www.uniprot.org/help/checksum) of the sequence. Can be used for cheaper lookups.
+sequenceVersionDate    | `DATE`          | Date when the sequence data was last modified in UniProt
+taxId                  | `INT64`         | NCBI taxonomy id of the originating species
+uniprotAccession       | `STRING`        | Uniprot accession ID
+uniprotDescription     | `STRING`        | The name recommended by the UniProt consortium
+uniprotEnd             | `INT64`         | Number of the last residue in the entry relative to the UniProt entry. This is equal to the length of the protein unless we are dealing with protein fragments.
+uniprotId              | `STRING`        | The Uniprot EntryName field
+uniprotSequence        | `STRING`        | Amino acid sequence for this prediction
+uniprotStart           | `INT64`         | Number of the first residue in the entry relative to the UniProt entry. This is 1 unless we are dealing with protein fragments.
+
+#### Producing summary statistics
+
+The following query gives the mean of the prediction confidence fractions per
+species.
+
+```sql
+SELECT
+ organismScientificName AS name,
+ SUM(fractionPlddtVeryLow) / COUNT(fractionPlddtVeryLow) AS mean_plddt_very_low,
+ SUM(fractionPlddtLow) / COUNT(fractionPlddtLow) AS mean_plddt_low,
+ SUM(fractionPlddtConfident) / COUNT(fractionPlddtConfident) AS mean_plddt_confident,
+ SUM(fractionPlddtVeryHigh) / COUNT(fractionPlddtVeryHigh) AS mean_plddt_very_high,
+ COUNT(organismScientificName) AS num_predictions
+FROM bigquery-public-data.deepmind_alphafold.metadata
+GROUP by name
+ORDER BY num_predictions DESC;
+```
+
+#### Producing lists of files
+
+We expect that the most important use for the metadata will be to create subsets
+of proteins according to various criteria, so that users can choose to only copy
+a subset of the 214M proteins that exist in the dataset. An example query is
+given below:
+
+```sql
+with file_rows AS (
+  with file_cols AS (
+    SELECT
+      CONCAT(entryID, '-model_v4.cif') as m,
+      CONCAT(entryID, '-predicted_aligned_error_v4.json') as p
+    FROM bigquery-public-data.deepmind_alphafold.metadata
+    WHERE organismScientificName = "Homo sapiens"
+      AND (fractionPlddtVeryHigh + fractionPlddtConfident) > 0.5
+  )
+  SELECT * FROM file_cols UNPIVOT (files for filetype in (m, p))
+)
+SELECT CONCAT('gs://public-datasets-deepmind-alphafold-v4/', files) as files
+from file_rows
+```
+
+In this case, the list has been filtered to only include proteins from *Homo
+sapiens* for which over half the residues are confident or better (>70 pLDDT).
+
+This creates a table with one column "files", where each row is the cloud
+location of one of the two file types that has been provided for each protein.
+There is an additional `confidence_v4.json` file which contains the
+per-residue pLDDT. This information is already in the CIF file but may be
+preferred if only this information is required.
+
+This allows users to bulk download the exact proteins they need, without having
+to download the entire dataset. Other columns may also be used to select subsets
+of proteins, and we point the user to the
+[BigQuery documentation](https://cloud.google.com/bigquery/docs) to understand
+other ways to filter for their desired protein lists. Likewise, the
+documentation should be followed to download these file subsets locally, as the
+most appropriate approach will depend on the filesize. Note that it may be
+easier to download large files using [Colab](https://colab.research.google.com/)
+(e.g. pandas to_csv).
+
+#### Previous versions
+Previous versions of AFDB will remain available at
+[gs://public-datasets-deepmind-alphafold](https://console.cloud.google.com/storage/browser/public-datasets-deepmind-alphafold)
+to enable reproducible research. We recommend using the latest version (v4).
\ No newline at end of file
--- a/alphafold/__init__.py
+++ b/alphafold/__init__.py
+# Copyright 2021 DeepMind Technologies Limited
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#      http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""An implementation of the inference pipeline of AlphaFold v2.0."""
--- a/alphafold/common/__init__.py
+++ b/alphafold/common/__init__.py
+# Copyright 2021 DeepMind Technologies Limited
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#      http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Common data types and constants used within Alphafold."""
--- a/alphafold/common/confidence.py
+++ b/alphafold/common/confidence.py
+# Copyright 2021 DeepMind Technologies Limited
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#      http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Functions for processing confidence metrics."""
+
+from typing import Dict, Optional, Tuple
+import numpy as np
+import scipy.special
+
+
+def compute_plddt(logits: np.ndarray) -> np.ndarray:
+  """Computes per-residue pLDDT from logits.
+
+  Args:
+    logits: [num_res, num_bins] output from the PredictedLDDTHead.
+
+  Returns:
+    plddt: [num_res] per-residue pLDDT.
+  """
+  num_bins = logits.shape[-1]
+  bin_width = 1.0 / num_bins
+  bin_centers = np.arange(start=0.5 * bin_width, stop=1.0, step=bin_width)
+  probs = scipy.special.softmax(logits, axis=-1)
+  predicted_lddt_ca = np.sum(probs * bin_centers[None, :], axis=-1)
+  return predicted_lddt_ca * 100
+
+
+def _calculate_bin_centers(breaks: np.ndarray):
+  """Gets the bin centers from the bin edges.
+
+  Args:
+    breaks: [num_bins - 1] the error bin edges.
+
+  Returns:
+    bin_centers: [num_bins] the error bin centers.
+  """
+  step = (breaks[1] - breaks[0])
+
+  # Add half-step to get the center
+  bin_centers = breaks + step / 2
+  # Add a catch-all bin at the end.
+  bin_centers = np.concatenate([bin_centers, [bin_centers[-1] + step]],
+                               axis=0)
+  return bin_centers
+
+
+def _calculate_expected_aligned_error(
+    alignment_confidence_breaks: np.ndarray,
+    aligned_distance_error_probs: np.ndarray) -> Tuple[np.ndarray, np.ndarray]:
+  """Calculates expected aligned distance errors for every pair of residues.
+
+  Args:
+    alignment_confidence_breaks: [num_bins - 1] the error bin edges.
+    aligned_distance_error_probs: [num_res, num_res, num_bins] the predicted
+      probs for each error bin, for each pair of residues.
+
+  Returns:
+    predicted_aligned_error: [num_res, num_res] the expected aligned distance
+      error for each pair of residues.
+    max_predicted_aligned_error: The maximum predicted error possible.
+  """
+  bin_centers = _calculate_bin_centers(alignment_confidence_breaks)
+
+  # Tuple of expected aligned distance error and max possible error.
+  return (np.sum(aligned_distance_error_probs * bin_centers, axis=-1),
+          np.asarray(bin_centers[-1]))
+
+
+def compute_predicted_aligned_error(
+    logits: np.ndarray,
+    breaks: np.ndarray) -> Dict[str, np.ndarray]:
+  """Computes aligned confidence metrics from logits.
+
+  Args:
+    logits: [num_res, num_res, num_bins] the logits output from
+      PredictedAlignedErrorHead.
+    breaks: [num_bins - 1] the error bin edges.
+
+  Returns:
+    aligned_confidence_probs: [num_res, num_res, num_bins] the predicted
+      aligned error probabilities over bins for each residue pair.
+    predicted_aligned_error: [num_res, num_res] the expected aligned distance
+      error for each pair of residues.
+    max_predicted_aligned_error: The maximum predicted error possible.
+  """
+  aligned_confidence_probs = scipy.special.softmax(
+      logits,
+      axis=-1)
+  predicted_aligned_error, max_predicted_aligned_error = (
+      _calculate_expected_aligned_error(
+          alignment_confidence_breaks=breaks,
+          aligned_distance_error_probs=aligned_confidence_probs))
+  return {
+      'aligned_confidence_probs': aligned_confidence_probs,
+      'predicted_aligned_error': predicted_aligned_error,
+      'max_predicted_aligned_error': max_predicted_aligned_error,
+  }
+
+
+def predicted_tm_score(
+    logits: np.ndarray,
+    breaks: np.ndarray,
+    residue_weights: Optional[np.ndarray] = None,
+    asym_id: Optional[np.ndarray] = None,
+    interface: bool = False) -> np.ndarray:
+  """Computes predicted TM alignment or predicted interface TM alignment score.
+
+  Args:
+    logits: [num_res, num_res, num_bins] the logits output from
+      PredictedAlignedErrorHead.
+    breaks: [num_bins] the error bins.
+    residue_weights: [num_res] the per residue weights to use for the
+      expectation.
+    asym_id: [num_res] the asymmetric unit ID - the chain ID. Only needed for
+      ipTM calculation, i.e. when interface=True.
+    interface: If True, interface predicted TM score is computed.
+
+  Returns:
+    ptm_score: The predicted TM alignment or the predicted iTM score.
+  """
+
+  # residue_weights has to be in [0, 1], but can be floating-point, i.e. the
+  # exp. resolved head's probability.
+  if residue_weights is None:
+    residue_weights = np.ones(logits.shape[0])
+
+  bin_centers = _calculate_bin_centers(breaks)
+
+  num_res = int(np.sum(residue_weights))
+  # Clip num_res to avoid negative/undefined d0.
+  clipped_num_res = max(num_res, 19)
+
+  # Compute d_0(num_res) as defined by TM-score, eqn. (5) in Yang & Skolnick
+  # "Scoring function for automated assessment of protein structure template
+  # quality", 2004: http://zhanglab.ccmb.med.umich.edu/papers/2004_3.pdf
+  d0 = 1.24 * (clipped_num_res - 15) ** (1./3) - 1.8
+
+  # Convert logits to probs.
+  probs = scipy.special.softmax(logits, axis=-1)
+
+  # TM-Score term for every bin.
+  tm_per_bin = 1. / (1 + np.square(bin_centers) / np.square(d0))
+  # E_distances tm(distance).
+  predicted_tm_term = np.sum(probs * tm_per_bin, axis=-1)
+
+  pair_mask = np.ones(shape=(num_res, num_res), dtype=bool)
+  if interface:
+    pair_mask *= asym_id[:, None] != asym_id[None, :]
+
+  predicted_tm_term *= pair_mask
+
+  pair_residue_weights = pair_mask * (
+      residue_weights[None, :] * residue_weights[:, None])
+  normed_residue_mask = pair_residue_weights / (1e-8 + np.sum(
+      pair_residue_weights, axis=-1, keepdims=True))
+  per_alignment = np.sum(predicted_tm_term * normed_residue_mask, axis=-1)
+  return np.asarray(per_alignment[(per_alignment * residue_weights).argmax()])
--- a/alphafold/common/protein.py
+++ b/alphafold/common/protein.py
+# Copyright 2021 DeepMind Technologies Limited
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#      http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Protein data type."""
+import dataclasses
+import io
+from typing import Any, Mapping, Optional
+from alphafold.common import residue_constants
+from Bio.PDB import PDBParser
+import numpy as np
+
+FeatureDict = Mapping[str, np.ndarray]
+ModelOutput = Mapping[str, Any]  # Is a nested dict.
+
+# Complete sequence of chain IDs supported by the PDB format.
+PDB_CHAIN_IDS = 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789'
+PDB_MAX_CHAINS = len(PDB_CHAIN_IDS)  # := 62.
+
+
+@dataclasses.dataclass(frozen=True)
+class Protein:
+  """Protein structure representation."""
+
+  # Cartesian coordinates of atoms in angstroms. The atom types correspond to
+  # residue_constants.atom_types, i.e. the first three are N, CA, CB.
+  atom_positions: np.ndarray  # [num_res, num_atom_type, 3]
+
+  # Amino-acid type for each residue represented as an integer between 0 and
+  # 20, where 20 is 'X'.
+  aatype: np.ndarray  # [num_res]
+
+  # Binary float mask to indicate presence of a particular atom. 1.0 if an atom
+  # is present and 0.0 if not. This should be used for loss masking.
+  atom_mask: np.ndarray  # [num_res, num_atom_type]
+
+  # Residue index as used in PDB. It is not necessarily continuous or 0-indexed.
+  residue_index: np.ndarray  # [num_res]
+
+  # 0-indexed number corresponding to the chain in the protein that this residue
+  # belongs to.
+  chain_index: np.ndarray  # [num_res]
+
+  # B-factors, or temperature factors, of each residue (in sq. angstroms units),
+  # representing the displacement of the residue from its ground truth mean
+  # value.
+  b_factors: np.ndarray  # [num_res, num_atom_type]
+
+  def __post_init__(self):
+    if len(np.unique(self.chain_index)) > PDB_MAX_CHAINS:
+      raise ValueError(
+          f'Cannot build an instance with more than {PDB_MAX_CHAINS} chains '
+          'because these cannot be written to PDB format.')
+
+
+def from_pdb_string(pdb_str: str, chain_id: Optional[str] = None) -> Protein:
+  """Takes a PDB string and constructs a Protein object.
+
+  WARNING: All non-standard residue types will be converted into UNK. All
+    non-standard atoms will be ignored.
+
+  Args:
+    pdb_str: The contents of the pdb file
+    chain_id: If chain_id is specified (e.g. A), then only that chain
+      is parsed. Otherwise all chains are parsed.
+
+  Returns:
+    A new `Protein` parsed from the pdb contents.
+  """
+  pdb_fh = io.StringIO(pdb_str)
+  parser = PDBParser(QUIET=True)
+  structure = parser.get_structure('none', pdb_fh)
+  models = list(structure.get_models())
+  if len(models) != 1:
+    raise ValueError(
+        f'Only single model PDBs are supported. Found {len(models)} models.')
+  model = models[0]
+
+  atom_positions = []
+  aatype = []
+  atom_mask = []
+  residue_index = []
+  chain_ids = []
+  b_factors = []
+
+  for chain in model:
+    if chain_id is not None and chain.id != chain_id:
+      continue
+    for res in chain:
+      if res.id[2] != ' ':
+        raise ValueError(
+            f'PDB contains an insertion code at chain {chain.id} and residue '
+            f'index {res.id[1]}. These are not supported.')
+      res_shortname = residue_constants.restype_3to1.get(res.resname, 'X')
+      restype_idx = residue_constants.restype_order.get(
+          res_shortname, residue_constants.restype_num)
+      pos = np.zeros((residue_constants.atom_type_num, 3))
+      mask = np.zeros((residue_constants.atom_type_num,))
+      res_b_factors = np.zeros((residue_constants.atom_type_num,))
+      for atom in res:
+        if atom.name not in residue_constants.atom_types:
+          continue
+        pos[residue_constants.atom_order[atom.name]] = atom.coord
+        mask[residue_constants.atom_order[atom.name]] = 1.
+        res_b_factors[residue_constants.atom_order[atom.name]] = atom.bfactor
+      if np.sum(mask) < 0.5:
+        # If no known atom positions are reported for the residue then skip it.
+        continue
+      aatype.append(restype_idx)
+      atom_positions.append(pos)
+      atom_mask.append(mask)
+      residue_index.append(res.id[1])
+      chain_ids.append(chain.id)
+      b_factors.append(res_b_factors)
+
+  # Chain IDs are usually characters so map these to ints.
+  unique_chain_ids = np.unique(chain_ids)
+  chain_id_mapping = {cid: n for n, cid in enumerate(unique_chain_ids)}
+  chain_index = np.array([chain_id_mapping[cid] for cid in chain_ids])
+
+  return Protein(
+      atom_positions=np.array(atom_positions),
+      atom_mask=np.array(atom_mask),
+      aatype=np.array(aatype),
+      residue_index=np.array(residue_index),
+      chain_index=chain_index,
+      b_factors=np.array(b_factors))
+
+
+def _chain_end(atom_index, end_resname, chain_name, residue_index) -> str:
+  chain_end = 'TER'
+  return (f'{chain_end:<6}{atom_index:>5}      {end_resname:>3} '
+          f'{chain_name:>1}{residue_index:>4}')
+
+
+def to_pdb(prot: Protein) -> str:
+  """Converts a `Protein` instance to a PDB string.
+
+  Args:
+    prot: The protein to convert to PDB.
+
+  Returns:
+    PDB string.
+  """
+  restypes = residue_constants.restypes + ['X']
+  res_1to3 = lambda r: residue_constants.restype_1to3.get(restypes[r], 'UNK')
+  atom_types = residue_constants.atom_types
+
+  pdb_lines = []
+
+  atom_mask = prot.atom_mask
+  aatype = prot.aatype
+  atom_positions = prot.atom_positions
+  residue_index = prot.residue_index.astype(np.int32)
+  chain_index = prot.chain_index.astype(np.int32)
+  b_factors = prot.b_factors
+
+  if np.any(aatype > residue_constants.restype_num):
+    raise ValueError('Invalid aatypes.')
+
+  # Construct a mapping from chain integer indices to chain ID strings.
+  chain_ids = {}
+  for i in np.unique(chain_index):  # np.unique gives sorted output.
+    if i >= PDB_MAX_CHAINS:
+      raise ValueError(
+          f'The PDB format supports at most {PDB_MAX_CHAINS} chains.')
+    chain_ids[i] = PDB_CHAIN_IDS[i]
+
+  pdb_lines.append('MODEL     1')
+  atom_index = 1
+  last_chain_index = chain_index[0]
+  # Add all atom sites.
+  for i in range(aatype.shape[0]):
+    # Close the previous chain if in a multichain PDB.
+    if last_chain_index != chain_index[i]:
+      pdb_lines.append(_chain_end(
+          atom_index, res_1to3(aatype[i - 1]), chain_ids[chain_index[i - 1]],
+          residue_index[i - 1]))
+      last_chain_index = chain_index[i]
+      atom_index += 1  # Atom index increases at the TER symbol.
+
+    res_name_3 = res_1to3(aatype[i])
+    for atom_name, pos, mask, b_factor in zip(
+        atom_types, atom_positions[i], atom_mask[i], b_factors[i]):
+      if mask < 0.5:
+        continue
+
+      record_type = 'ATOM'
+      name = atom_name if len(atom_name) == 4 else f' {atom_name}'
+      alt_loc = ''
+      insertion_code = ''
+      occupancy = 1.00
+      element = atom_name[0]  # Protein supports only C, N, O, S, this works.
+      charge = ''
+      # PDB is a columnar format, every space matters here!
+      atom_line = (f'{record_type:<6}{atom_index:>5} {name:<4}{alt_loc:>1}'
+                   f'{res_name_3:>3} {chain_ids[chain_index[i]]:>1}'
+                   f'{residue_index[i]:>4}{insertion_code:>1}   '
+                   f'{pos[0]:>8.3f}{pos[1]:>8.3f}{pos[2]:>8.3f}'
+                   f'{occupancy:>6.2f}{b_factor:>6.2f}          '
+                   f'{element:>2}{charge:>2}')
+      pdb_lines.append(atom_line)
+      atom_index += 1
+
+  # Close the final chain.
+  pdb_lines.append(_chain_end(atom_index, res_1to3(aatype[-1]),
+                              chain_ids[chain_index[-1]], residue_index[-1]))
+  pdb_lines.append('ENDMDL')
+  pdb_lines.append('END')
+
+  # Pad all lines to 80 characters.
+  pdb_lines = [line.ljust(80) for line in pdb_lines]
+  return '\n'.join(pdb_lines) + '\n'  # Add terminating newline.
+
+
+def ideal_atom_mask(prot: Protein) -> np.ndarray:
+  """Computes an ideal atom mask.
+
+  `Protein.atom_mask` typically is defined according to the atoms that are
+  reported in the PDB. This function computes a mask according to heavy atoms
+  that should be present in the given sequence of amino acids.
+
+  Args:
+    prot: `Protein` whose fields are `numpy.ndarray` objects.
+
+  Returns:
+    An ideal atom mask.
+  """
+  return residue_constants.STANDARD_ATOM_MASK[prot.aatype]
+
+
+def from_prediction(
+    features: FeatureDict,
+    result: ModelOutput,
+    b_factors: Optional[np.ndarray] = None,
+    remove_leading_feature_dimension: bool = True) -> Protein:
+  """Assembles a protein from a prediction.
+
+  Args:
+    features: Dictionary holding model inputs.
+    result: Dictionary holding model outputs.
+    b_factors: (Optional) B-factors to use for the protein.
+    remove_leading_feature_dimension: Whether to remove the leading dimension
+      of the `features` values.
+
+  Returns:
+    A protein instance.
+  """
+  fold_output = result['structure_module']
+
+  def _maybe_remove_leading_dim(arr: np.ndarray) -> np.ndarray:
+    return arr[0] if remove_leading_feature_dimension else arr
+
+  if 'asym_id' in features:
+    chain_index = _maybe_remove_leading_dim(features['asym_id'])
+  else:
+    chain_index = np.zeros_like(_maybe_remove_leading_dim(features['aatype']))
+
+  if b_factors is None:
+    b_factors = np.zeros_like(fold_output['final_atom_mask'])
+
+  return Protein(
+      aatype=_maybe_remove_leading_dim(features['aatype']),
+      atom_positions=fold_output['final_atom_positions'],
+      atom_mask=fold_output['final_atom_mask'],
+      residue_index=_maybe_remove_leading_dim(features['residue_index']) + 1,
+      chain_index=chain_index,
+      b_factors=b_factors)
--- a/alphafold/common/protein_test.py
+++ b/alphafold/common/protein_test.py
+# Copyright 2021 DeepMind Technologies Limited
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#      http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Tests for protein."""
+
+import os
+
+from absl.testing import absltest
+from absl.testing import parameterized
+from alphafold.common import protein
+from alphafold.common import residue_constants
+import numpy as np
+# Internal import (7716).
+
+TEST_DATA_DIR = 'alphafold/common/testdata/'
+
+
+class ProteinTest(parameterized.TestCase):
+
+  def _check_shapes(self, prot, num_res):
+    """Check that the processed shapes are correct."""
+    num_atoms = residue_constants.atom_type_num
+    self.assertEqual((num_res, num_atoms, 3), prot.atom_positions.shape)
+    self.assertEqual((num_res,), prot.aatype.shape)
+    self.assertEqual((num_res, num_atoms), prot.atom_mask.shape)
+    self.assertEqual((num_res,), prot.residue_index.shape)
+    self.assertEqual((num_res,), prot.chain_index.shape)
+    self.assertEqual((num_res, num_atoms), prot.b_factors.shape)
+
+  @parameterized.named_parameters(
+      dict(testcase_name='chain_A',
+           pdb_file='2rbg.pdb', chain_id='A', num_res=282, num_chains=1),
+      dict(testcase_name='chain_B',
+           pdb_file='2rbg.pdb', chain_id='B', num_res=282, num_chains=1),
+      dict(testcase_name='multichain',
+           pdb_file='2rbg.pdb', chain_id=None, num_res=564, num_chains=2))
+  def test_from_pdb_str(self, pdb_file, chain_id, num_res, num_chains):
+    pdb_file = os.path.join(absltest.get_default_test_srcdir(), TEST_DATA_DIR,
+                            pdb_file)
+    with open(pdb_file) as f:
+      pdb_string = f.read()
+    prot = protein.from_pdb_string(pdb_string, chain_id)
+    self._check_shapes(prot, num_res)
+    self.assertGreaterEqual(prot.aatype.min(), 0)
+    # Allow equal since unknown restypes have index equal to restype_num.
+    self.assertLessEqual(prot.aatype.max(), residue_constants.restype_num)
+    self.assertLen(np.unique(prot.chain_index), num_chains)
+
+  def test_to_pdb(self):
+    with open(
+        os.path.join(absltest.get_default_test_srcdir(), TEST_DATA_DIR,
+                     '2rbg.pdb')) as f:
+      pdb_string = f.read()
+    prot = protein.from_pdb_string(pdb_string)
+    pdb_string_reconstr = protein.to_pdb(prot)
+
+    for line in pdb_string_reconstr.splitlines():
+      self.assertLen(line, 80)
+
+    prot_reconstr = protein.from_pdb_string(pdb_string_reconstr)
+
+    np.testing.assert_array_equal(prot_reconstr.aatype, prot.aatype)
+    np.testing.assert_array_almost_equal(
+        prot_reconstr.atom_positions, prot.atom_positions)
+    np.testing.assert_array_almost_equal(
+        prot_reconstr.atom_mask, prot.atom_mask)
+    np.testing.assert_array_equal(
+        prot_reconstr.residue_index, prot.residue_index)
+    np.testing.assert_array_equal(
+        prot_reconstr.chain_index, prot.chain_index)
+    np.testing.assert_array_almost_equal(
+        prot_reconstr.b_factors, prot.b_factors)
+
+  def test_ideal_atom_mask(self):
+    with open(
+        os.path.join(absltest.get_default_test_srcdir(), TEST_DATA_DIR,
+                     '2rbg.pdb')) as f:
+      pdb_string = f.read()
+    prot = protein.from_pdb_string(pdb_string)
+    ideal_mask = protein.ideal_atom_mask(prot)
+    non_ideal_residues = set([102] + list(range(127, 286)))
+    for i, (res, atom_mask) in enumerate(
+        zip(prot.residue_index, prot.atom_mask)):
+      if res in non_ideal_residues:
+        self.assertFalse(np.all(atom_mask == ideal_mask[i]), msg=f'{res}')
+      else:
+        self.assertTrue(np.all(atom_mask == ideal_mask[i]), msg=f'{res}')
+
+  def test_too_many_chains(self):
+    num_res = protein.PDB_MAX_CHAINS + 1
+    num_atom_type = residue_constants.atom_type_num
+    with self.assertRaises(ValueError):
+      _ = protein.Protein(
+          atom_positions=np.random.random([num_res, num_atom_type, 3]),
+          aatype=np.random.randint(0, 21, [num_res]),
+          atom_mask=np.random.randint(0, 2, [num_res]).astype(np.float32),
+          residue_index=np.arange(1, num_res+1),
+          chain_index=np.arange(num_res),
+          b_factors=np.random.uniform(1, 100, [num_res]))
+
+
+if __name__ == '__main__':
+  absltest.main()
--- a/alphafold/common/residue_constants.py
+++ b/alphafold/common/residue_constants.py
--- a/alphafold/common/residue_constants_test.py
+++ b/alphafold/common/residue_constants_test.py
+# Copyright 2021 DeepMind Technologies Limited
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#      http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Test that residue_constants generates correct values."""
+
+from absl.testing import absltest
+from absl.testing import parameterized
+from alphafold.common import residue_constants
+import numpy as np
+
+
+class ResidueConstantsTest(parameterized.TestCase):
+
+  @parameterized.parameters(
+      ('ALA', 0),
+      ('CYS', 1),
+      ('HIS', 2),
+      ('MET', 3),
+      ('LYS', 4),
+      ('ARG', 4),
+  )
+  def testChiAnglesAtoms(self, residue_name, chi_num):
+    chi_angles_atoms = residue_constants.chi_angles_atoms[residue_name]
+    self.assertLen(chi_angles_atoms, chi_num)
+    for chi_angle_atoms in chi_angles_atoms:
+      self.assertLen(chi_angle_atoms, 4)
+
+  def testChiGroupsForAtom(self):
+    for k, chi_groups in residue_constants.chi_groups_for_atom.items():
+      res_name, atom_name = k
+      for chi_group_i, atom_i in chi_groups:
+        self.assertEqual(
+            atom_name,
+            residue_constants.chi_angles_atoms[res_name][chi_group_i][atom_i])
+
+  @parameterized.parameters(
+      ('ALA', 5), ('ARG', 11), ('ASN', 8), ('ASP', 8), ('CYS', 6), ('GLN', 9),
+      ('GLU', 9), ('GLY', 4), ('HIS', 10), ('ILE', 8), ('LEU', 8), ('LYS', 9),
+      ('MET', 8), ('PHE', 11), ('PRO', 7), ('SER', 6), ('THR', 7), ('TRP', 14),
+      ('TYR', 12), ('VAL', 7)
+  )
+  def testResidueAtoms(self, atom_name, num_residue_atoms):
+    residue_atoms = residue_constants.residue_atoms[atom_name]
+    self.assertLen(residue_atoms, num_residue_atoms)
+
+  def testStandardAtomMask(self):
+    with self.subTest('Check shape'):
+      self.assertEqual(residue_constants.STANDARD_ATOM_MASK.shape, (21, 37,))
+
+    with self.subTest('Check values'):
+      str_to_row = lambda s: [c == '1' for c in s]  # More clear/concise.
+      np.testing.assert_array_equal(
+          residue_constants.STANDARD_ATOM_MASK,
+          np.array([
+              # NB This was defined by c+p but looks sane.
+              str_to_row('11111                                '),  # ALA
+              str_to_row('111111     1           1     11 1    '),  # ARG
+              str_to_row('111111         11                    '),  # ASP
+              str_to_row('111111          11                   '),  # ASN
+              str_to_row('11111     1                          '),  # CYS
+              str_to_row('111111     1             11          '),  # GLU
+              str_to_row('111111     1              11         '),  # GLN
+              str_to_row('111 1                                '),  # GLY
+              str_to_row('111111       11     1    1           '),  # HIS
+              str_to_row('11111 11    1                        '),  # ILE
+              str_to_row('111111      11                       '),  # LEU
+              str_to_row('111111     1       1               1 '),  # LYS
+              str_to_row('111111            11                 '),  # MET
+              str_to_row('111111      11      11          1    '),  # PHE
+              str_to_row('111111     1                         '),  # PRO
+              str_to_row('11111   1                            '),  # SER
+              str_to_row('11111  1 1                           '),  # THR
+              str_to_row('111111      11       11 1   1    11  '),  # TRP
+              str_to_row('111111      11      11         11    '),  # TYR
+              str_to_row('11111 11                             '),  # VAL
+              str_to_row('                                     '),  # UNK
+          ]))
+
+    with self.subTest('Check row totals'):
+      # Check each row has the right number of atoms.
+      for row, restype in enumerate(residue_constants.restypes):  # A, R, ...
+        long_restype = residue_constants.restype_1to3[restype]  # ALA, ARG, ...
+        atoms_names = residue_constants.residue_atoms[
+            long_restype]  # ['C', 'CA', 'CB', 'N', 'O'], ...
+        self.assertLen(atoms_names,
+                       residue_constants.STANDARD_ATOM_MASK[row, :].sum(),
+                       long_restype)
+
+  def testAtomTypes(self):
+    self.assertEqual(residue_constants.atom_type_num, 37)
+
+    self.assertEqual(residue_constants.atom_types[0], 'N')
+    self.assertEqual(residue_constants.atom_types[1], 'CA')
+    self.assertEqual(residue_constants.atom_types[2], 'C')
+    self.assertEqual(residue_constants.atom_types[3], 'CB')
+    self.assertEqual(residue_constants.atom_types[4], 'O')
+
+    self.assertEqual(residue_constants.atom_order['N'], 0)
+    self.assertEqual(residue_constants.atom_order['CA'], 1)
+    self.assertEqual(residue_constants.atom_order['C'], 2)
+    self.assertEqual(residue_constants.atom_order['CB'], 3)
+    self.assertEqual(residue_constants.atom_order['O'], 4)
+    self.assertEqual(residue_constants.atom_type_num, 37)
+
+  def testRestypes(self):
+    three_letter_restypes = [
+        residue_constants.restype_1to3[r] for r  in residue_constants.restypes]
+    for restype, exp_restype in zip(
+        three_letter_restypes, sorted(residue_constants.restype_1to3.values())):
+      self.assertEqual(restype, exp_restype)
+    self.assertEqual(residue_constants.restype_num, 20)
+
+  def testSequenceToOneHotHHBlits(self):
+    one_hot = residue_constants.sequence_to_onehot(
+        'ABCDEFGHIJKLMNOPQRSTUVWXYZ-', residue_constants.HHBLITS_AA_TO_ID)
+    exp_one_hot = np.array(
+        [[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
+         [0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
+         [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
+         [0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
+         [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
+         [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
+         [0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
+         [0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
+         [0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
+         [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0],
+         [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
+         [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
+         [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
+         [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
+         [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0],
+         [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
+         [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0],
+         [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0],
+         [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0],
+         [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0],
+         [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
+         [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0],
+         [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0],
+         [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0],
+         [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0],
+         [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
+         [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]])
+    np.testing.assert_array_equal(one_hot, exp_one_hot)
+
+  def testSequenceToOneHotStandard(self):
+    one_hot = residue_constants.sequence_to_onehot(
+        'ARNDCQEGHILKMFPSTWYV', residue_constants.restype_order)
+    np.testing.assert_array_equal(one_hot, np.eye(20))
+
+  def testSequenceToOneHotUnknownMapping(self):
+    seq = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
+    expected_out = np.zeros([26, 21])
+    for row, position in enumerate(
+        [0, 20, 4, 3, 6, 13, 7, 8, 9, 20, 11, 10, 12, 2, 20, 14, 5, 1, 15, 16,
+         20, 19, 17, 20, 18, 20]):
+      expected_out[row, position] = 1
+    aa_types = residue_constants.sequence_to_onehot(
+        sequence=seq,
+        mapping=residue_constants.restype_order_with_x,
+        map_unknown_to_x=True)
+    self.assertTrue((aa_types == expected_out).all())
+
+  @parameterized.named_parameters(
+      ('lowercase', 'aaa'),  # Insertions in A3M.
+      ('gaps', '---'),  # Gaps in A3M.
+      ('dots', '...'),  # Gaps in A3M.
+      ('metadata', '>TEST'),  # FASTA metadata line.
+  )
+  def testSequenceToOneHotUnknownMappingError(self, seq):
+    with self.assertRaises(ValueError):
+      residue_constants.sequence_to_onehot(
+          sequence=seq,
+          mapping=residue_constants.restype_order_with_x,
+          map_unknown_to_x=True)
+
+
+if __name__ == '__main__':
+  absltest.main()
--- a/alphafold/common/testdata/2rbg.pdb
+++ b/alphafold/common/testdata/2rbg.pdb
--- a/alphafold/data/__init__.py
+++ b/alphafold/data/__init__.py
+# Copyright 2021 DeepMind Technologies Limited
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#      http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Data pipeline for model features."""
--- a/alphafold/data/feature_processing.py
+++ b/alphafold/data/feature_processing.py
+# Copyright 2021 DeepMind Technologies Limited
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#      http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Feature processing logic for multimer data pipeline."""
+
+from typing import Iterable, MutableMapping, List
+
+from alphafold.common import residue_constants
+from alphafold.data import msa_pairing
+from alphafold.data import pipeline
+import numpy as np
+
+REQUIRED_FEATURES = frozenset({
+    'aatype', 'all_atom_mask', 'all_atom_positions', 'all_chains_entity_ids',
+    'all_crops_all_chains_mask', 'all_crops_all_chains_positions',
+    'all_crops_all_chains_residue_ids', 'assembly_num_chains', 'asym_id',
+    'bert_mask', 'cluster_bias_mask', 'deletion_matrix', 'deletion_mean',
+    'entity_id', 'entity_mask', 'mem_peak', 'msa', 'msa_mask', 'num_alignments',
+    'num_templates', 'queue_size', 'residue_index', 'resolution',
+    'seq_length', 'seq_mask', 'sym_id', 'template_aatype',
+    'template_all_atom_mask', 'template_all_atom_positions'
+})
+
+MAX_TEMPLATES = 4
+MSA_CROP_SIZE = 2048
+
+
+def _is_homomer_or_monomer(chains: Iterable[pipeline.FeatureDict]) -> bool:
+  """Checks if a list of chains represents a homomer/monomer example."""
+  # Note that an entity_id of 0 indicates padding.
+  num_unique_chains = len(np.unique(np.concatenate(
+      [np.unique(chain['entity_id'][chain['entity_id'] > 0]) for
+       chain in chains])))
+  return num_unique_chains == 1
+
+
+def pair_and_merge(
+    all_chain_features: MutableMapping[str, pipeline.FeatureDict]
+    ) -> pipeline.FeatureDict:
+  """Runs processing on features to augment, pair and merge.
+
+  Args:
+    all_chain_features: A MutableMap of dictionaries of features for each chain.
+
+  Returns:
+    A dictionary of features.
+  """
+
+  process_unmerged_features(all_chain_features)
+
+  np_chains_list = list(all_chain_features.values())
+
+  pair_msa_sequences = not _is_homomer_or_monomer(np_chains_list)
+
+  if pair_msa_sequences:
+    np_chains_list = msa_pairing.create_paired_features(
+        chains=np_chains_list)
+    np_chains_list = msa_pairing.deduplicate_unpaired_sequences(np_chains_list)
+  np_chains_list = crop_chains(
+      np_chains_list,
+      msa_crop_size=MSA_CROP_SIZE,
+      pair_msa_sequences=pair_msa_sequences,
+      max_templates=MAX_TEMPLATES)
+  np_example = msa_pairing.merge_chain_features(
+      np_chains_list=np_chains_list, pair_msa_sequences=pair_msa_sequences,
+      max_templates=MAX_TEMPLATES)
+  np_example = process_final(np_example)
+  return np_example
+
+
+def crop_chains(
+    chains_list: List[pipeline.FeatureDict],
+    msa_crop_size: int,
+    pair_msa_sequences: bool,
+    max_templates: int) -> List[pipeline.FeatureDict]:
+  """Crops the MSAs for a set of chains.
+
+  Args:
+    chains_list: A list of chains to be cropped.
+    msa_crop_size: The total number of sequences to crop from the MSA.
+    pair_msa_sequences: Whether we are operating in sequence-pairing mode.
+    max_templates: The maximum templates to use per chain.
+
+  Returns:
+    The chains cropped.
+  """
+
+  # Apply the cropping.
+  cropped_chains = []
+  for chain in chains_list:
+    cropped_chain = _crop_single_chain(
+        chain,
+        msa_crop_size=msa_crop_size,
+        pair_msa_sequences=pair_msa_sequences,
+        max_templates=max_templates)
+    cropped_chains.append(cropped_chain)
+
+  return cropped_chains
+
+
+def _crop_single_chain(chain: pipeline.FeatureDict,
+                       msa_crop_size: int,
+                       pair_msa_sequences: bool,
+                       max_templates: int) -> pipeline.FeatureDict:
+  """Crops msa sequences to `msa_crop_size`."""
+  msa_size = chain['num_alignments']
+
+  if pair_msa_sequences:
+    msa_size_all_seq = chain['num_alignments_all_seq']
+    msa_crop_size_all_seq = np.minimum(msa_size_all_seq, msa_crop_size // 2)
+
+    # We reduce the number of un-paired sequences, by the number of times a
+    # sequence from this chain's MSA is included in the paired MSA.  This keeps
+    # the MSA size for each chain roughly constant.
+    msa_all_seq = chain['msa_all_seq'][:msa_crop_size_all_seq, :]
+    num_non_gapped_pairs = np.sum(
+        np.any(msa_all_seq != msa_pairing.MSA_GAP_IDX, axis=1))
+    num_non_gapped_pairs = np.minimum(num_non_gapped_pairs,
+                                      msa_crop_size_all_seq)
+
+    # Restrict the unpaired crop size so that paired+unpaired sequences do not
+    # exceed msa_seqs_per_chain for each chain.
+    max_msa_crop_size = np.maximum(msa_crop_size - num_non_gapped_pairs, 0)
+    msa_crop_size = np.minimum(msa_size, max_msa_crop_size)
+  else:
+    msa_crop_size = np.minimum(msa_size, msa_crop_size)
+
+  include_templates = 'template_aatype' in chain and max_templates
+  if include_templates:
+    num_templates = chain['template_aatype'].shape[0]
+    templates_crop_size = np.minimum(num_templates, max_templates)
+
+  for k in chain:
+    k_split = k.split('_all_seq')[0]
+    if k_split in msa_pairing.TEMPLATE_FEATURES:
+      chain[k] = chain[k][:templates_crop_size, :]
+    elif k_split in msa_pairing.MSA_FEATURES:
+      if '_all_seq' in k and pair_msa_sequences:
+        chain[k] = chain[k][:msa_crop_size_all_seq, :]
+      else:
+        chain[k] = chain[k][:msa_crop_size, :]
+
+  chain['num_alignments'] = np.asarray(msa_crop_size, dtype=np.int32)
+  if include_templates:
+    chain['num_templates'] = np.asarray(templates_crop_size, dtype=np.int32)
+  if pair_msa_sequences:
+    chain['num_alignments_all_seq'] = np.asarray(
+        msa_crop_size_all_seq, dtype=np.int32)
+  return chain
+
+
+def process_final(np_example: pipeline.FeatureDict) -> pipeline.FeatureDict:
+  """Final processing steps in data pipeline, after merging and pairing."""
+  np_example = _correct_msa_restypes(np_example)
+  np_example = _make_seq_mask(np_example)
+  np_example = _make_msa_mask(np_example)
+  np_example = _filter_features(np_example)
+  return np_example
+
+
+def _correct_msa_restypes(np_example):
+  """Correct MSA restype to have the same order as residue_constants."""
+  new_order_list = residue_constants.MAP_HHBLITS_AATYPE_TO_OUR_AATYPE
+  np_example['msa'] = np.take(new_order_list, np_example['msa'], axis=0)
+  np_example['msa'] = np_example['msa'].astype(np.int32)
+  return np_example
+
+
+def _make_seq_mask(np_example):
+  np_example['seq_mask'] = (np_example['entity_id'] > 0).astype(np.float32)
+  return np_example
+
+
+def _make_msa_mask(np_example):
+  """Mask features are all ones, but will later be zero-padded."""
+
+  np_example['msa_mask'] = np.ones_like(np_example['msa'], dtype=np.float32)
+
+  seq_mask = (np_example['entity_id'] > 0).astype(np.float32)
+  np_example['msa_mask'] *= seq_mask[None]
+
+  return np_example
+
+
+def _filter_features(np_example: pipeline.FeatureDict) -> pipeline.FeatureDict:
+  """Filters features of example to only those requested."""
+  return {k: v for (k, v) in np_example.items() if k in REQUIRED_FEATURES}
+
+
+def process_unmerged_features(
+    all_chain_features: MutableMapping[str, pipeline.FeatureDict]):
+  """Postprocessing stage for per-chain features before merging."""
+  num_chains = len(all_chain_features)
+  for chain_features in all_chain_features.values():
+    # Convert deletion matrices to float.
+    chain_features['deletion_matrix'] = np.asarray(
+        chain_features.pop('deletion_matrix_int'), dtype=np.float32)
+    if 'deletion_matrix_int_all_seq' in chain_features:
+      chain_features['deletion_matrix_all_seq'] = np.asarray(
+          chain_features.pop('deletion_matrix_int_all_seq'), dtype=np.float32)
+
+    chain_features['deletion_mean'] = np.mean(
+        chain_features['deletion_matrix'], axis=0)
+
+    # Add all_atom_mask and dummy all_atom_positions based on aatype.
+    all_atom_mask = residue_constants.STANDARD_ATOM_MASK[
+        chain_features['aatype']]
+    chain_features['all_atom_mask'] = all_atom_mask
+    chain_features['all_atom_positions'] = np.zeros(
+        list(all_atom_mask.shape) + [3])
+
+    # Add assembly_num_chains.
+    chain_features['assembly_num_chains'] = np.asarray(num_chains)
+
+  # Add entity_mask.
+  for chain_features in all_chain_features.values():
+    chain_features['entity_mask'] = (
+        chain_features['entity_id'] != 0).astype(np.int32)
--- a/alphafold/data/mmcif_parsing.py
+++ b/alphafold/data/mmcif_parsing.py
+# Copyright 2021 DeepMind Technologies Limited
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#      http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Parses the mmCIF file format."""
+import collections
+import dataclasses
+import functools
+import io
+from typing import Any, Mapping, Optional, Sequence, Tuple
+
+from absl import logging
+from Bio import PDB
+from Bio.Data import SCOPData
+
+# Type aliases:
+ChainId = str
+PdbHeader = Mapping[str, Any]
+PdbStructure = PDB.Structure.Structure
+SeqRes = str
+MmCIFDict = Mapping[str, Sequence[str]]
+
+
+@dataclasses.dataclass(frozen=True)
+class Monomer:
+  id: str
+  num: int
+
+
+# Note - mmCIF format provides no guarantees on the type of author-assigned
+# sequence numbers. They need not be integers.
+@dataclasses.dataclass(frozen=True)
+class AtomSite:
+  residue_name: str
+  author_chain_id: str
+  mmcif_chain_id: str
+  author_seq_num: str
+  mmcif_seq_num: int
+  insertion_code: str
+  hetatm_atom: str
+  model_num: int
+
+
+# Used to map SEQRES index to a residue in the structure.
+@dataclasses.dataclass(frozen=True)
+class ResiduePosition:
+  chain_id: str
+  residue_number: int
+  insertion_code: str
+
+
+@dataclasses.dataclass(frozen=True)
+class ResidueAtPosition:
+  position: Optional[ResiduePosition]
+  name: str
+  is_missing: bool
+  hetflag: str
+
+
+@dataclasses.dataclass(frozen=True)
+class MmcifObject:
+  """Representation of a parsed mmCIF file.
+
+  Contains:
+    file_id: A meaningful name, e.g. a pdb_id. Should be unique amongst all
+      files being processed.
+    header: Biopython header.
+    structure: Biopython structure.
+    chain_to_seqres: Dict mapping chain_id to 1 letter amino acid sequence. E.g.
+      {'A': 'ABCDEFG'}
+    seqres_to_structure: Dict; for each chain_id contains a mapping between
+      SEQRES index and a ResidueAtPosition. e.g. {'A': {0: ResidueAtPosition,
+                                                        1: ResidueAtPosition,
+                                                        ...}}
+    raw_string: The raw string used to construct the MmcifObject.
+  """
+  file_id: str
+  header: PdbHeader
+  structure: PdbStructure
+  chain_to_seqres: Mapping[ChainId, SeqRes]
+  seqres_to_structure: Mapping[ChainId, Mapping[int, ResidueAtPosition]]
+  raw_string: Any
+
+
+@dataclasses.dataclass(frozen=True)
+class ParsingResult:
+  """Returned by the parse function.
+
+  Contains:
+    mmcif_object: A MmcifObject, may be None if no chain could be successfully
+      parsed.
+    errors: A dict mapping (file_id, chain_id) to any exception generated.
+  """
+  mmcif_object: Optional[MmcifObject]
+  errors: Mapping[Tuple[str, str], Any]
+
+
+class ParseError(Exception):
+  """An error indicating that an mmCIF file could not be parsed."""
+
+
+def mmcif_loop_to_list(prefix: str,
+                       parsed_info: MmCIFDict) -> Sequence[Mapping[str, str]]:
+  """Extracts loop associated with a prefix from mmCIF data as a list.
+
+  Reference for loop_ in mmCIF:
+    http://mmcif.wwpdb.org/docs/tutorials/mechanics/pdbx-mmcif-syntax.html
+
+  Args:
+    prefix: Prefix shared by each of the data items in the loop.
+      e.g. '_entity_poly_seq.', where the data items are _entity_poly_seq.num,
+      _entity_poly_seq.mon_id. Should include the trailing period.
+    parsed_info: A dict of parsed mmCIF data, e.g. _mmcif_dict from a Biopython
+      parser.
+
+  Returns:
+    Returns a list of dicts; each dict represents 1 entry from an mmCIF loop.
+  """
+  cols = []
+  data = []
+  for key, value in parsed_info.items():
+    if key.startswith(prefix):
+      cols.append(key)
+      data.append(value)
+
+  assert all([len(xs) == len(data[0]) for xs in data]), (
+      'mmCIF error: Not all loops are the same length: %s' % cols)
+
+  return [dict(zip(cols, xs)) for xs in zip(*data)]
+
+
+def mmcif_loop_to_dict(prefix: str,
+                       index: str,
+                       parsed_info: MmCIFDict,
+                       ) -> Mapping[str, Mapping[str, str]]:
+  """Extracts loop associated with a prefix from mmCIF data as a dictionary.
+
+  Args:
+    prefix: Prefix shared by each of the data items in the loop.
+      e.g. '_entity_poly_seq.', where the data items are _entity_poly_seq.num,
+      _entity_poly_seq.mon_id. Should include the trailing period.
+    index: Which item of loop data should serve as the key.
+    parsed_info: A dict of parsed mmCIF data, e.g. _mmcif_dict from a Biopython
+      parser.
+
+  Returns:
+    Returns a dict of dicts; each dict represents 1 entry from an mmCIF loop,
+    indexed by the index column.
+  """
+  entries = mmcif_loop_to_list(prefix, parsed_info)
+  return {entry[index]: entry for entry in entries}
+
+
+@functools.lru_cache(16, typed=False)
+def parse(*,
+          file_id: str,
+          mmcif_string: str,
+          catch_all_errors: bool = True) -> ParsingResult:
+  """Entry point, parses an mmcif_string.
+
+  Args:
+    file_id: A string identifier for this file. Should be unique within the
+      collection of files being processed.
+    mmcif_string: Contents of an mmCIF file.
+    catch_all_errors: If True, all exceptions are caught and error messages are
+      returned as part of the ParsingResult. If False exceptions will be allowed
+      to propagate.
+
+  Returns:
+    A ParsingResult.
+  """
+  errors = {}
+  try:
+    parser = PDB.MMCIFParser(QUIET=True)
+    handle = io.StringIO(mmcif_string)
+    full_structure = parser.get_structure('', handle)
+    first_model_structure = _get_first_model(full_structure)
+    # Extract the _mmcif_dict from the parser, which contains useful fields not
+    # reflected in the Biopython structure.
+    parsed_info = parser._mmcif_dict  # pylint:disable=protected-access
+
+    # Ensure all values are lists, even if singletons.
+    for key, value in parsed_info.items():
+      if not isinstance(value, list):
+        parsed_info[key] = [value]
+
+    header = _get_header(parsed_info)
+
+    # Determine the protein chains, and their start numbers according to the
+    # internal mmCIF numbering scheme (likely but not guaranteed to be 1).
+    valid_chains = _get_protein_chains(parsed_info=parsed_info)
+    if not valid_chains:
+      return ParsingResult(
+          None, {(file_id, ''): 'No protein chains found in this file.'})
+    seq_start_num = {chain_id: min([monomer.num for monomer in seq])
+                     for chain_id, seq in valid_chains.items()}
+
+    # Loop over the atoms for which we have coordinates. Populate two mappings:
+    # -mmcif_to_author_chain_id (maps internal mmCIF chain ids to chain ids used
+    # the authors / Biopython).
+    # -seq_to_structure_mappings (maps idx into sequence to ResidueAtPosition).
+    mmcif_to_author_chain_id = {}
+    seq_to_structure_mappings = {}
+    for atom in _get_atom_site_list(parsed_info):
+      if atom.model_num != '1':
+        # We only process the first model at the moment.
+        continue
+
+      mmcif_to_author_chain_id[atom.mmcif_chain_id] = atom.author_chain_id
+
+      if atom.mmcif_chain_id in valid_chains:
+        hetflag = ' '
+        if atom.hetatm_atom == 'HETATM':
+          # Water atoms are assigned a special hetflag of W in Biopython. We
+          # need to do the same, so that this hetflag can be used to fetch
+          # a residue from the Biopython structure by id.
+          if atom.residue_name in ('HOH', 'WAT'):
+            hetflag = 'W'
+          else:
+            hetflag = 'H_' + atom.residue_name
+        insertion_code = atom.insertion_code
+        if not _is_set(atom.insertion_code):
+          insertion_code = ' '
+        position = ResiduePosition(chain_id=atom.author_chain_id,
+                                   residue_number=int(atom.author_seq_num),
+                                   insertion_code=insertion_code)
+        seq_idx = int(atom.mmcif_seq_num) - seq_start_num[atom.mmcif_chain_id]
+        current = seq_to_structure_mappings.get(atom.author_chain_id, {})
+        current[seq_idx] = ResidueAtPosition(position=position,
+                                             name=atom.residue_name,
+                                             is_missing=False,
+                                             hetflag=hetflag)
+        seq_to_structure_mappings[atom.author_chain_id] = current
+
+    # Add missing residue information to seq_to_structure_mappings.
+    for chain_id, seq_info in valid_chains.items():
+      author_chain = mmcif_to_author_chain_id[chain_id]
+      current_mapping = seq_to_structure_mappings[author_chain]
+      for idx, monomer in enumerate(seq_info):
+        if idx not in current_mapping:
+          current_mapping[idx] = ResidueAtPosition(position=None,
+                                                   name=monomer.id,
+                                                   is_missing=True,
+                                                   hetflag=' ')
+
+    author_chain_to_sequence = {}
+    for chain_id, seq_info in valid_chains.items():
+      author_chain = mmcif_to_author_chain_id[chain_id]
+      seq = []
+      for monomer in seq_info:
+        code = SCOPData.protein_letters_3to1.get(monomer.id, 'X')
+        seq.append(code if len(code) == 1 else 'X')
+      seq = ''.join(seq)
+      author_chain_to_sequence[author_chain] = seq
+
+    mmcif_object = MmcifObject(
+        file_id=file_id,
+        header=header,
+        structure=first_model_structure,
+        chain_to_seqres=author_chain_to_sequence,
+        seqres_to_structure=seq_to_structure_mappings,
+        raw_string=parsed_info)
+
+    return ParsingResult(mmcif_object=mmcif_object, errors=errors)
+  except Exception as e:  # pylint:disable=broad-except
+    errors[(file_id, '')] = e
+    if not catch_all_errors:
+      raise
+    return ParsingResult(mmcif_object=None, errors=errors)
+
+
+def _get_first_model(structure: PdbStructure) -> PdbStructure:
+  """Returns the first model in a Biopython structure."""
+  return next(structure.get_models())
+
+_MIN_LENGTH_OF_CHAIN_TO_BE_COUNTED_AS_PEPTIDE = 21
+
+
+def get_release_date(parsed_info: MmCIFDict) -> str:
+  """Returns the oldest revision date."""
+  revision_dates = parsed_info['_pdbx_audit_revision_history.revision_date']
+  return min(revision_dates)
+
+
+def _get_header(parsed_info: MmCIFDict) -> PdbHeader:
+  """Returns a basic header containing method, release date and resolution."""
+  header = {}
+
+  experiments = mmcif_loop_to_list('_exptl.', parsed_info)
+  header['structure_method'] = ','.join([
+      experiment['_exptl.method'].lower() for experiment in experiments])
+
+  # Note: The release_date here corresponds to the oldest revision. We prefer to
+  # use this for dataset filtering over the deposition_date.
+  if '_pdbx_audit_revision_history.revision_date' in parsed_info:
+    header['release_date'] = get_release_date(parsed_info)
+  else:
+    logging.warning('Could not determine release_date: %s',
+                    parsed_info['_entry.id'])
+
+  header['resolution'] = 0.00
+  for res_key in ('_refine.ls_d_res_high', '_em_3d_reconstruction.resolution',
+                  '_reflns.d_resolution_high'):
+    if res_key in parsed_info:
+      try:
+        raw_resolution = parsed_info[res_key][0]
+        header['resolution'] = float(raw_resolution)
+      except ValueError:
+        logging.debug('Invalid resolution format: %s', parsed_info[res_key])
+
+  return header
+
+
+def _get_atom_site_list(parsed_info: MmCIFDict) -> Sequence[AtomSite]:
+  """Returns list of atom sites; contains data not present in the structure."""
+  return [AtomSite(*site) for site in zip(  # pylint:disable=g-complex-comprehension
+      parsed_info['_atom_site.label_comp_id'],
+      parsed_info['_atom_site.auth_asym_id'],
+      parsed_info['_atom_site.label_asym_id'],
+      parsed_info['_atom_site.auth_seq_id'],
+      parsed_info['_atom_site.label_seq_id'],
+      parsed_info['_atom_site.pdbx_PDB_ins_code'],
+      parsed_info['_atom_site.group_PDB'],
+      parsed_info['_atom_site.pdbx_PDB_model_num'],
+      )]
+
+
+def _get_protein_chains(
+    *, parsed_info: Mapping[str, Any]) -> Mapping[ChainId, Sequence[Monomer]]:
+  """Extracts polymer information for protein chains only.
+
+  Args:
+    parsed_info: _mmcif_dict produced by the Biopython parser.
+
+  Returns:
+    A dict mapping mmcif chain id to a list of Monomers.
+  """
+  # Get polymer information for each entity in the structure.
+  entity_poly_seqs = mmcif_loop_to_list('_entity_poly_seq.', parsed_info)
+
+  polymers = collections.defaultdict(list)
+  for entity_poly_seq in entity_poly_seqs:
+    polymers[entity_poly_seq['_entity_poly_seq.entity_id']].append(
+        Monomer(id=entity_poly_seq['_entity_poly_seq.mon_id'],
+                num=int(entity_poly_seq['_entity_poly_seq.num'])))
+
+  # Get chemical compositions. Will allow us to identify which of these polymers
+  # are proteins.
+  chem_comps = mmcif_loop_to_dict('_chem_comp.', '_chem_comp.id', parsed_info)
+
+  # Get chains information for each entity. Necessary so that we can return a
+  # dict keyed on chain id rather than entity.
+  struct_asyms = mmcif_loop_to_list('_struct_asym.', parsed_info)
+
+  entity_to_mmcif_chains = collections.defaultdict(list)
+  for struct_asym in struct_asyms:
+    chain_id = struct_asym['_struct_asym.id']
+    entity_id = struct_asym['_struct_asym.entity_id']
+    entity_to_mmcif_chains[entity_id].append(chain_id)
+
+  # Identify and return the valid protein chains.
+  valid_chains = {}
+  for entity_id, seq_info in polymers.items():
+    chain_ids = entity_to_mmcif_chains[entity_id]
+
+    # Reject polymers without any peptide-like components, such as DNA/RNA.
+    if any(['peptide' in chem_comps[monomer.id]['_chem_comp.type'].lower()
+            for monomer in seq_info]):
+      for chain_id in chain_ids:
+        valid_chains[chain_id] = seq_info
+  return valid_chains
+
+
+def _is_set(data: str) -> bool:
+  """Returns False if data is a special mmCIF character indicating 'unset'."""
+  return data not in ('.', '?')
--- a/alphafold/data/msa_identifiers.py
+++ b/alphafold/data/msa_identifiers.py
+# Copyright 2021 DeepMind Technologies Limited
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#      http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Utilities for extracting identifiers from MSA sequence descriptions."""
+
+import dataclasses
+import re
+from typing import Optional
+
+
+# Sequences coming from UniProtKB database come in the
+# `db|UniqueIdentifier|EntryName` format, e.g. `tr|A0A146SKV9|A0A146SKV9_FUNHE`
+# or `sp|P0C2L1|A3X1_LOXLA` (for TREMBL/Swiss-Prot respectively).
+_UNIPROT_PATTERN = re.compile(
+    r"""
+    ^
+    # UniProtKB/TrEMBL or UniProtKB/Swiss-Prot
+    (?:tr|sp)
+    \|
+    # A primary accession number of the UniProtKB entry.
+    (?P<AccessionIdentifier>[A-Za-z0-9]{6,10})
+    # Occasionally there is a _0 or _1 isoform suffix, which we ignore.
+    (?:_\d)?
+    \|
+    # TREMBL repeats the accession ID here. Swiss-Prot has a mnemonic
+    # protein ID code.
+    (?:[A-Za-z0-9]+)
+    _
+    # A mnemonic species identification code.
+    (?P<SpeciesIdentifier>([A-Za-z0-9]){1,5})
+    # Small BFD uses a final value after an underscore, which we ignore.
+    (?:_\d+)?
+    $
+    """,
+    re.VERBOSE)
+
+
+@dataclasses.dataclass(frozen=True)
+class Identifiers:
+  species_id: str = ''
+
+
+def _parse_sequence_identifier(msa_sequence_identifier: str) -> Identifiers:
+  """Gets species from an msa sequence identifier.
+
+  The sequence identifier has the format specified by
+  _UNIPROT_TREMBL_ENTRY_NAME_PATTERN or _UNIPROT_SWISSPROT_ENTRY_NAME_PATTERN.
+  An example of a sequence identifier: `tr|A0A146SKV9|A0A146SKV9_FUNHE`
+
+  Args:
+    msa_sequence_identifier: a sequence identifier.
+
+  Returns:
+    An `Identifiers` instance with species_id. These
+    can be empty in the case where no identifier was found.
+  """
+  matches = re.search(_UNIPROT_PATTERN, msa_sequence_identifier.strip())
+  if matches:
+    return Identifiers(
+        species_id=matches.group('SpeciesIdentifier'))
+  return Identifiers()
+
+
+def _extract_sequence_identifier(description: str) -> Optional[str]:
+  """Extracts sequence identifier from description. Returns None if no match."""
+  split_description = description.split()
+  if split_description:
+    return split_description[0].partition('/')[0]
+  else:
+    return None
+
+
+def get_identifiers(description: str) -> Identifiers:
+  """Computes extra MSA features from the description."""
+  sequence_identifier = _extract_sequence_identifier(description)
+  if sequence_identifier is None:
+    return Identifiers()
+  else:
+    return _parse_sequence_identifier(sequence_identifier)
--- a/alphafold/data/msa_pairing.py
+++ b/alphafold/data/msa_pairing.py
+# Copyright 2021 DeepMind Technologies Limited
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#      http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Pairing logic for multimer data pipeline."""
+
+import collections
+import functools
+import string
+from typing import Any, Dict, Iterable, List, Sequence
+
+from alphafold.common import residue_constants
+from alphafold.data import pipeline
+import numpy as np
+import pandas as pd
+import scipy.linalg
+
+MSA_GAP_IDX = residue_constants.restypes_with_x_and_gap.index('-')
+SEQUENCE_GAP_CUTOFF = 0.5
+SEQUENCE_SIMILARITY_CUTOFF = 0.9
+
+MSA_PAD_VALUES = {'msa_all_seq': MSA_GAP_IDX,
+                  'msa_mask_all_seq': 1,
+                  'deletion_matrix_all_seq': 0,
+                  'deletion_matrix_int_all_seq': 0,
+                  'msa': MSA_GAP_IDX,
+                  'msa_mask': 1,
+                  'deletion_matrix': 0,
+                  'deletion_matrix_int': 0}
+
+MSA_FEATURES = ('msa', 'msa_mask', 'deletion_matrix', 'deletion_matrix_int')
+SEQ_FEATURES = ('residue_index', 'aatype', 'all_atom_positions',
+                'all_atom_mask', 'seq_mask', 'between_segment_residues',
+                'has_alt_locations', 'has_hetatoms', 'asym_id', 'entity_id',
+                'sym_id', 'entity_mask', 'deletion_mean',
+                'prediction_atom_mask',
+                'literature_positions', 'atom_indices_to_group_indices',
+                'rigid_group_default_frame')
+TEMPLATE_FEATURES = ('template_aatype', 'template_all_atom_positions',
+                     'template_all_atom_mask')
+CHAIN_FEATURES = ('num_alignments', 'seq_length')
+
+
+def create_paired_features(
+    chains: Iterable[pipeline.FeatureDict]) ->  List[pipeline.FeatureDict]:
+  """Returns the original chains with paired NUM_SEQ features.
+
+  Args:
+    chains:  A list of feature dictionaries for each chain.
+
+  Returns:
+    A list of feature dictionaries with sequence features including only
+    rows to be paired.
+  """
+  chains = list(chains)
+  chain_keys = chains[0].keys()
+
+  if len(chains) < 2:
+    return chains
+  else:
+    updated_chains = []
+    paired_chains_to_paired_row_indices = pair_sequences(chains)
+    paired_rows = reorder_paired_rows(
+        paired_chains_to_paired_row_indices)
+
+    for chain_num, chain in enumerate(chains):
+      new_chain = {k: v for k, v in chain.items() if '_all_seq' not in k}
+      for feature_name in chain_keys:
+        if feature_name.endswith('_all_seq'):
+          feats_padded = pad_features(chain[feature_name], feature_name)
+          new_chain[feature_name] = feats_padded[paired_rows[:, chain_num]]
+      new_chain['num_alignments_all_seq'] = np.asarray(
+          len(paired_rows[:, chain_num]))
+      updated_chains.append(new_chain)
+    return updated_chains
+
+
+def pad_features(feature: np.ndarray, feature_name: str) -> np.ndarray:
+  """Add a 'padding' row at the end of the features list.
+
+  The padding row will be selected as a 'paired' row in the case of partial
+  alignment - for the chain that doesn't have paired alignment.
+
+  Args:
+    feature: The feature to be padded.
+    feature_name: The name of the feature to be padded.
+
+  Returns:
+    The feature with an additional padding row.
+  """
+  assert feature.dtype != np.dtype(np.string_)
+  if feature_name in ('msa_all_seq', 'msa_mask_all_seq',
+                      'deletion_matrix_all_seq', 'deletion_matrix_int_all_seq'):
+    num_res = feature.shape[1]
+    padding = MSA_PAD_VALUES[feature_name] * np.ones([1, num_res],
+                                                     feature.dtype)
+  elif feature_name == 'msa_species_identifiers_all_seq':
+    padding = [b'']
+  else:
+    return feature
+  feats_padded = np.concatenate([feature, padding], axis=0)
+  return feats_padded
+
+
+def _make_msa_df(chain_features: pipeline.FeatureDict) -> pd.DataFrame:
+  """Makes dataframe with msa features needed for msa pairing."""
+  chain_msa = chain_features['msa_all_seq']
+  query_seq = chain_msa[0]
+  per_seq_similarity = np.sum(
+      query_seq[None] == chain_msa, axis=-1) / float(len(query_seq))
+  per_seq_gap = np.sum(chain_msa == 21, axis=-1) / float(len(query_seq))
+  msa_df = pd.DataFrame({
+      'msa_species_identifiers':
+          chain_features['msa_species_identifiers_all_seq'],
+      'msa_row':
+          np.arange(len(
+              chain_features['msa_species_identifiers_all_seq'])),
+      'msa_similarity': per_seq_similarity,
+      'gap': per_seq_gap
+  })
+  return msa_df
+
+
+def _create_species_dict(msa_df: pd.DataFrame) -> Dict[bytes, pd.DataFrame]:
+  """Creates mapping from species to msa dataframe of that species."""
+  species_lookup = {}
+  for species, species_df in msa_df.groupby('msa_species_identifiers'):
+    species_lookup[species] = species_df
+  return species_lookup
+
+
+def _match_rows_by_sequence_similarity(this_species_msa_dfs: List[pd.DataFrame]
+                                       ) -> List[List[int]]:
+  """Finds MSA sequence pairings across chains based on sequence similarity.
+
+  Each chain's MSA sequences are first sorted by their sequence similarity to
+  their respective target sequence. The sequences are then paired, starting
+  from the sequences most similar to their target sequence.
+
+  Args:
+    this_species_msa_dfs: a list of dataframes containing MSA features for
+      sequences for a specific species.
+
+  Returns:
+   A list of lists, each containing M indices corresponding to paired MSA rows,
+   where M is the number of chains.
+  """
+  all_paired_msa_rows = []
+
+  num_seqs = [len(species_df) for species_df in this_species_msa_dfs
+              if species_df is not None]
+  take_num_seqs = np.min(num_seqs)
+
+  sort_by_similarity = (
+      lambda x: x.sort_values('msa_similarity', axis=0, ascending=False))
+
+  for species_df in this_species_msa_dfs:
+    if species_df is not None:
+      species_df_sorted = sort_by_similarity(species_df)
+      msa_rows = species_df_sorted.msa_row.iloc[:take_num_seqs].values
+    else:
+      msa_rows = [-1] * take_num_seqs  # take the last 'padding' row
+    all_paired_msa_rows.append(msa_rows)
+  all_paired_msa_rows = list(np.array(all_paired_msa_rows).transpose())
+  return all_paired_msa_rows
+
+
+def pair_sequences(examples: List[pipeline.FeatureDict]
+                   ) -> Dict[int, np.ndarray]:
+  """Returns indices for paired MSA sequences across chains."""
+
+  num_examples = len(examples)
+
+  all_chain_species_dict = []
+  common_species = set()
+  for chain_features in examples:
+    msa_df = _make_msa_df(chain_features)
+    species_dict = _create_species_dict(msa_df)
+    all_chain_species_dict.append(species_dict)
+    common_species.update(set(species_dict))
+
+  common_species = sorted(common_species)
+  common_species.remove(b'')  # Remove target sequence species.
+
+  all_paired_msa_rows = [np.zeros(len(examples), int)]
+  all_paired_msa_rows_dict = {k: [] for k in range(num_examples)}
+  all_paired_msa_rows_dict[num_examples] = [np.zeros(len(examples), int)]
+
+  for species in common_species:
+    if not species:
+      continue
+    this_species_msa_dfs = []
+    species_dfs_present = 0
+    for species_dict in all_chain_species_dict:
+      if species in species_dict:
+        this_species_msa_dfs.append(species_dict[species])
+        species_dfs_present += 1
+      else:
+        this_species_msa_dfs.append(None)
+
+    # Skip species that are present in only one chain.
+    if species_dfs_present <= 1:
+      continue
+
+    if np.any(
+        np.array([len(species_df) for species_df in
+                  this_species_msa_dfs if
+                  isinstance(species_df, pd.DataFrame)]) > 600):
+      continue
+
+    paired_msa_rows = _match_rows_by_sequence_similarity(this_species_msa_dfs)
+    all_paired_msa_rows.extend(paired_msa_rows)
+    all_paired_msa_rows_dict[species_dfs_present].extend(paired_msa_rows)
+  all_paired_msa_rows_dict = {
+      num_examples: np.array(paired_msa_rows) for
+      num_examples, paired_msa_rows in all_paired_msa_rows_dict.items()
+  }
+  return all_paired_msa_rows_dict
+
+
+def reorder_paired_rows(all_paired_msa_rows_dict: Dict[int, np.ndarray]
+                        ) -> np.ndarray:
+  """Creates a list of indices of paired MSA rows across chains.
+
+  Args:
+    all_paired_msa_rows_dict: a mapping from the number of paired chains to the
+      paired indices.
+
+  Returns:
+    a list of lists, each containing indices of paired MSA rows across chains.
+    The paired-index lists are ordered by:
+      1) the number of chains in the paired alignment, i.e, all-chain pairings
+         will come first.
+      2) e-values
+  """
+  all_paired_msa_rows = []
+
+  for num_pairings in sorted(all_paired_msa_rows_dict, reverse=True):
+    paired_rows = all_paired_msa_rows_dict[num_pairings]
+    paired_rows_product = abs(np.array([np.prod(rows) for rows in paired_rows]))
+    paired_rows_sort_index = np.argsort(paired_rows_product)
+    all_paired_msa_rows.extend(paired_rows[paired_rows_sort_index])
+
+  return np.array(all_paired_msa_rows)
+
+
+def block_diag(*arrs: np.ndarray, pad_value: float = 0.0) -> np.ndarray:
+  """Like scipy.linalg.block_diag but with an optional padding value."""
+  ones_arrs = [np.ones_like(x) for x in arrs]
+  off_diag_mask = 1.0 - scipy.linalg.block_diag(*ones_arrs)
+  diag = scipy.linalg.block_diag(*arrs)
+  diag += (off_diag_mask * pad_value).astype(diag.dtype)
+  return diag
+
+
+def _correct_post_merged_feats(
+    np_example: pipeline.FeatureDict,
+    np_chains_list: Sequence[pipeline.FeatureDict],
+    pair_msa_sequences: bool) -> pipeline.FeatureDict:
+  """Adds features that need to be computed/recomputed post merging."""
+
+  np_example['seq_length'] = np.asarray(np_example['aatype'].shape[0],
+                                        dtype=np.int32)
+  np_example['num_alignments'] = np.asarray(np_example['msa'].shape[0],
+                                            dtype=np.int32)
+
+  if not pair_msa_sequences:
+    # Generate a bias that is 1 for the first row of every block in the
+    # block diagonal MSA - i.e. make sure the cluster stack always includes
+    # the query sequences for each chain (since the first row is the query
+    # sequence).
+    cluster_bias_masks = []
+    for chain in np_chains_list:
+      mask = np.zeros(chain['msa'].shape[0])
+      mask[0] = 1
+      cluster_bias_masks.append(mask)
+    np_example['cluster_bias_mask'] = np.concatenate(cluster_bias_masks)
+
+    # Initialize Bert mask with masked out off diagonals.
+    msa_masks = [np.ones(x['msa'].shape, dtype=np.float32)
+                 for x in np_chains_list]
+
+    np_example['bert_mask'] = block_diag(
+        *msa_masks, pad_value=0)
+  else:
+    np_example['cluster_bias_mask'] = np.zeros(np_example['msa'].shape[0])
+    np_example['cluster_bias_mask'][0] = 1
+
+    # Initialize Bert mask with masked out off diagonals.
+    msa_masks = [np.ones(x['msa'].shape, dtype=np.float32) for
+                 x in np_chains_list]
+    msa_masks_all_seq = [np.ones(x['msa_all_seq'].shape, dtype=np.float32) for
+                         x in np_chains_list]
+
+    msa_mask_block_diag = block_diag(
+        *msa_masks, pad_value=0)
+    msa_mask_all_seq = np.concatenate(msa_masks_all_seq, axis=1)
+    np_example['bert_mask'] = np.concatenate(
+        [msa_mask_all_seq, msa_mask_block_diag], axis=0)
+  return np_example
+
+
+def _pad_templates(chains: Sequence[pipeline.FeatureDict],
+                   max_templates: int) -> Sequence[pipeline.FeatureDict]:
+  """For each chain pad the number of templates to a fixed size.
+
+  Args:
+    chains: A list of protein chains.
+    max_templates: Each chain will be padded to have this many templates.
+
+  Returns:
+    The list of chains, updated to have template features padded to
+    max_templates.
+  """
+  for chain in chains:
+    for k, v in chain.items():
+      if k in TEMPLATE_FEATURES:
+        padding = np.zeros_like(v.shape)
+        padding[0] = max_templates - v.shape[0]
+        padding = [(0, p) for p in padding]
+        chain[k] = np.pad(v, padding, mode='constant')
+  return chains
+
+
+def _merge_features_from_multiple_chains(
+    chains: Sequence[pipeline.FeatureDict],
+    pair_msa_sequences: bool) -> pipeline.FeatureDict:
+  """Merge features from multiple chains.
+
+  Args:
+    chains: A list of feature dictionaries that we want to merge.
+    pair_msa_sequences: Whether to concatenate MSA features along the
+      num_res dimension (if True), or to block diagonalize them (if False).
+
+  Returns:
+    A feature dictionary for the merged example.
+  """
+  merged_example = {}
+  for feature_name in chains[0]:
+    feats = [x[feature_name] for x in chains]
+    feature_name_split = feature_name.split('_all_seq')[0]
+    if feature_name_split in MSA_FEATURES:
+      if pair_msa_sequences or '_all_seq' in feature_name:
+        merged_example[feature_name] = np.concatenate(feats, axis=1)
+      else:
+        merged_example[feature_name] = block_diag(
+            *feats, pad_value=MSA_PAD_VALUES[feature_name])
+    elif feature_name_split in SEQ_FEATURES:
+      merged_example[feature_name] = np.concatenate(feats, axis=0)
+    elif feature_name_split in TEMPLATE_FEATURES:
+      merged_example[feature_name] = np.concatenate(feats, axis=1)
+    elif feature_name_split in CHAIN_FEATURES:
+      merged_example[feature_name] = np.sum(x for x in feats).astype(np.int32)
+    else:
+      merged_example[feature_name] = feats[0]
+  return merged_example
+
+
+def _merge_homomers_dense_msa(
+    chains: Iterable[pipeline.FeatureDict]) -> Sequence[pipeline.FeatureDict]:
+  """Merge all identical chains, making the resulting MSA dense.
+
+  Args:
+    chains: An iterable of features for each chain.
+
+  Returns:
+    A list of feature dictionaries.  All features with the same entity_id
+    will be merged - MSA features will be concatenated along the num_res
+    dimension - making them dense.
+  """
+  entity_chains = collections.defaultdict(list)
+  for chain in chains:
+    entity_id = chain['entity_id'][0]
+    entity_chains[entity_id].append(chain)
+
+  grouped_chains = []
+  for entity_id in sorted(entity_chains):
+    chains = entity_chains[entity_id]
+    grouped_chains.append(chains)
+  chains = [
+      _merge_features_from_multiple_chains(chains, pair_msa_sequences=True)
+      for chains in grouped_chains]
+  return chains
+
+
+def _concatenate_paired_and_unpaired_features(
+    example: pipeline.FeatureDict) -> pipeline.FeatureDict:
+  """Merges paired and block-diagonalised features."""
+  features = MSA_FEATURES
+  for feature_name in features:
+    if feature_name in example:
+      feat = example[feature_name]
+      feat_all_seq = example[feature_name + '_all_seq']
+      merged_feat = np.concatenate([feat_all_seq, feat], axis=0)
+      example[feature_name] = merged_feat
+  example['num_alignments'] = np.array(example['msa'].shape[0],
+                                       dtype=np.int32)
+  return example
+
+
+def merge_chain_features(np_chains_list: List[pipeline.FeatureDict],
+                         pair_msa_sequences: bool,
+                         max_templates: int) -> pipeline.FeatureDict:
+  """Merges features for multiple chains to single FeatureDict.
+
+  Args:
+    np_chains_list: List of FeatureDicts for each chain.
+    pair_msa_sequences: Whether to merge paired MSAs.
+    max_templates: The maximum number of templates to include.
+
+  Returns:
+    Single FeatureDict for entire complex.
+  """
+  np_chains_list = _pad_templates(
+      np_chains_list, max_templates=max_templates)
+  np_chains_list = _merge_homomers_dense_msa(np_chains_list)
+  # Unpaired MSA features will be always block-diagonalised; paired MSA
+  # features will be concatenated.
+  np_example = _merge_features_from_multiple_chains(
+      np_chains_list, pair_msa_sequences=False)
+  if pair_msa_sequences:
+    np_example = _concatenate_paired_and_unpaired_features(np_example)
+  np_example = _correct_post_merged_feats(
+      np_example=np_example,
+      np_chains_list=np_chains_list,
+      pair_msa_sequences=pair_msa_sequences)
+
+  return np_example
+
+
+def deduplicate_unpaired_sequences(
+    np_chains: List[pipeline.FeatureDict]) -> List[pipeline.FeatureDict]:
+  """Removes unpaired sequences which duplicate a paired sequence."""
+
+  feature_names = np_chains[0].keys()
+  msa_features = MSA_FEATURES
+
+  for chain in np_chains:
+    # Convert the msa_all_seq numpy array to a tuple for hashing.
+    sequence_set = set(tuple(s) for s in chain['msa_all_seq'])
+    keep_rows = []
+    # Go through unpaired MSA seqs and remove any rows that correspond to the
+    # sequences that are already present in the paired MSA.
+    for row_num, seq in enumerate(chain['msa']):
+      if tuple(seq) not in sequence_set:
+        keep_rows.append(row_num)
+    for feature_name in feature_names:
+      if feature_name in msa_features:
+        chain[feature_name] = chain[feature_name][keep_rows]
+    chain['num_alignments'] = np.array(chain['msa'].shape[0], dtype=np.int32)
+  return np_chains
--- a/alphafold/data/parsers.py
+++ b/alphafold/data/parsers.py
--- a/alphafold/data/pipeline.py
+++ b/alphafold/data/pipeline.py