dtk24.04.1

eb93322b · mashun1 · eb93322b · eb93322b · eb93322b · eb93322b
Commit eb93322b authored Aug 29, 2024 by mashun1
20 changed files
--- a/.dockerignore
+++ b/.dockerignore
+.dockerignore
+docker/Dockerfile
--- a/.gitignore
+++ b/.gitignore
+*.egg*
+tryme.ipynb
+build/
+dist/
+test/
+temp_output/
+temp_fasta/
+*pycache*
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
+# How to Contribute
+We welcome small patches related to bug fixes and documentation, but we do not
+plan to make any major changes to this repository.
+## Contributor License Agreement
+Contributions to this project must be accompanied by a Contributor License
+Agreement. You (or your employer) retain the copyright to your contribution,
+this simply gives us permission to use and redistribute your contributions as
+part of the project. Head over to <https://cla.developers.google.com/> to see
+your current agreements on file or to sign a new one.
+You generally only need to submit a CLA once, so if you've already submitted one
+(even if it was for a different project), you probably don't need to do it
+again.
+## Code reviews
+All submissions, including submissions by project members, require review. We
+use GitHub pull requests for this purpose. Consult
+[GitHub Help](https://help.github.com/articles/about-pull-requests/) for more
+information on using pull requests.
--- a/Dockerfile
+++ b/Dockerfile
+FROM image.sourcefind.cn:5000/dcu/admin/base/jax:0.4.23-ubuntu20.04-dtk24.04.1-py3.10
+# RUN apt update
+# WORKDIR /app
+# WORKDIR /app/softwares
+# RUN git clone https://github.com/soedinglab/hh-suite.git
+# RUN mkdir -p hh-suite/build && cd hh-suite/build && cmake -DCMAKE_INSTALL_PREFIX=. .. && make -j 4 && make install
+# ENV PATH=/app/softwares/hh-suite/build/bin:/app/softwares/hh-suite/build/scripts:$PATH
+# WORKDIR /app/softwares
+# RUN wget https://github.com/TimoLassmann/kalign/archive/refs/tags/v3.4.0.zip && unzip v3.4.0.zip && cd kalign-3.4.0 && mkdir build && cd build && cmake .. && make && make install
+# WORKDIR /app/softwares
+# RUN sudo apt install doxygen -y
+# RUN wget https://github.com/openmm/openmm/archive/refs/tags/8.0.0.zip && unzip 8.0.0.zip && cd openmm-8.0.0 && mkdir build && cd build && cmake .. && make && sudo make install && sudo make PythonInstall
+# WORKDIR /app/softwares
+# RUN wget https://github.com/openmm/pdbfixer/archive/refs/tags/1.9.zip && unzip 1.9.zip && cd pdbfixer-1.9 && python setup.py install 
+# RUN sudo apt install hmmer -y
+# WORKDIR /app
+# COPY . /app/alphafold2
+# RUN ls
+# RUN pip install --no-cache-dir -r /app/alphafold2/requirements_dcu.txt -i https://mirrors.ustc.edu.cn/pypi/web/simple
+# RUN pip install dm-haiku==0.0.11 flax==0.7.1 jmp==0.0.2 tabulate==0.8.9 --no-deps jax -i https://mirrors.ustc.edu.cn/pypi/web/simple
+# RUN pip install orbax==0.1.6 orbax-checkpoint==0.1.6 optax==0.2.2 -i https://mirrors.ustc.edu.cn/pypi/web/simple
+# WORKDIR /app/alphafold2
+# RUN python setup.py install
--- a/LICENSE
+++ b/LICENSE
+                                 Apache License
+                           Version 2.0, January 2004
+                        http://www.apache.org/licenses/
+   TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
+   1. Definitions.
+      "License" shall mean the terms and conditions for use, reproduction,
+      and distribution as defined by Sections 1 through 9 of this document.
+      "Licensor" shall mean the copyright owner or entity authorized by
+      the copyright owner that is granting the License.
+      "Legal Entity" shall mean the union of the acting entity and all
+      other entities that control, are controlled by, or are under common
+      control with that entity. For the purposes of this definition,
+      "control" means (i) the power, direct or indirect, to cause the
+      direction or management of such entity, whether by contract or
+      otherwise, or (ii) ownership of fifty percent (50%) or more of the
+      outstanding shares, or (iii) beneficial ownership of such entity.
+      "You" (or "Your") shall mean an individual or Legal Entity
+      exercising permissions granted by this License.
+      "Source" form shall mean the preferred form for making modifications,
+      including but not limited to software source code, documentation
+      source, and configuration files.
+      "Object" form shall mean any form resulting from mechanical
+      transformation or translation of a Source form, including but
+      not limited to compiled object code, generated documentation,
+      and conversions to other media types.
+      "Work" shall mean the work of authorship, whether in Source or
+      Object form, made available under the License, as indicated by a
+      copyright notice that is included in or attached to the work
+      (an example is provided in the Appendix below).
+      "Derivative Works" shall mean any work, whether in Source or Object
+      form, that is based on (or derived from) the Work and for which the
+      editorial revisions, annotations, elaborations, or other modifications
+      represent, as a whole, an original work of authorship. For the purposes
+      of this License, Derivative Works shall not include works that remain
+      separable from, or merely link (or bind by name) to the interfaces of,
+      the Work and Derivative Works thereof.
+      "Contribution" shall mean any work of authorship, including
+      the original version of the Work and any modifications or additions
+      to that Work or Derivative Works thereof, that is intentionally
+      submitted to Licensor for inclusion in the Work by the copyright owner
+      or by an individual or Legal Entity authorized to submit on behalf of
+      the copyright owner. For the purposes of this definition, "submitted"
+      means any form of electronic, verbal, or written communication sent
+      to the Licensor or its representatives, including but not limited to
+      communication on electronic mailing lists, source code control systems,
+      and issue tracking systems that are managed by, or on behalf of, the
+      Licensor for the purpose of discussing and improving the Work, but
+      excluding communication that is conspicuously marked or otherwise
+      designated in writing by the copyright owner as "Not a Contribution."
+      "Contributor" shall mean Licensor and any individual or Legal Entity
+      on behalf of whom a Contribution has been received by Licensor and
+      subsequently incorporated within the Work.
+   2. Grant of Copyright License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      copyright license to reproduce, prepare Derivative Works of,
+      publicly display, publicly perform, sublicense, and distribute the
+      Work and such Derivative Works in Source or Object form.
+   3. Grant of Patent License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      (except as stated in this section) patent license to make, have made,
+      use, offer to sell, sell, import, and otherwise transfer the Work,
+      where such license applies only to those patent claims licensable
+      by such Contributor that are necessarily infringed by their
+      Contribution(s) alone or by combination of their Contribution(s)
+      with the Work to which such Contribution(s) was submitted. If You
+      institute patent litigation against any entity (including a
+      cross-claim or counterclaim in a lawsuit) alleging that the Work
+      or a Contribution incorporated within the Work constitutes direct
+      or contributory patent infringement, then any patent licenses
+      granted to You under this License for that Work shall terminate
+      as of the date such litigation is filed.
+   4. Redistribution. You may reproduce and distribute copies of the
+      Work or Derivative Works thereof in any medium, with or without
+      modifications, and in Source or Object form, provided that You
+      meet the following conditions:
+      (a) You must give any other recipients of the Work or
+          Derivative Works a copy of this License; and
+      (b) You must cause any modified files to carry prominent notices
+          stating that You changed the files; and
+      (c) You must retain, in the Source form of any Derivative Works
+          that You distribute, all copyright, patent, trademark, and
+          attribution notices from the Source form of the Work,
+          excluding those notices that do not pertain to any part of
+          the Derivative Works; and
+      (d) If the Work includes a "NOTICE" text file as part of its
+          distribution, then any Derivative Works that You distribute must
+          include a readable copy of the attribution notices contained
+          within such NOTICE file, excluding those notices that do not
+          pertain to any part of the Derivative Works, in at least one
+          of the following places: within a NOTICE text file distributed
+          as part of the Derivative Works; within the Source form or
+          documentation, if provided along with the Derivative Works; or,
+          within a display generated by the Derivative Works, if and
+          wherever such third-party notices normally appear. The contents
+          of the NOTICE file are for informational purposes only and
+          do not modify the License. You may add Your own attribution
+          notices within Derivative Works that You distribute, alongside
+          or as an addendum to the NOTICE text from the Work, provided
+          that such additional attribution notices cannot be construed
+          as modifying the License.
+      You may add Your own copyright statement to Your modifications and
+      may provide additional or different license terms and conditions
+      for use, reproduction, or distribution of Your modifications, or
+      for any such Derivative Works as a whole, provided Your use,
+      reproduction, and distribution of the Work otherwise complies with
+      the conditions stated in this License.
+   5. Submission of Contributions. Unless You explicitly state otherwise,
+      any Contribution intentionally submitted for inclusion in the Work
+      by You to the Licensor shall be under the terms and conditions of
+      this License, without any additional terms or conditions.
+      Notwithstanding the above, nothing herein shall supersede or modify
+      the terms of any separate license agreement you may have executed
+      with Licensor regarding such Contributions.
+   6. Trademarks. This License does not grant permission to use the trade
+      names, trademarks, service marks, or product names of the Licensor,
+      except as required for reasonable and customary use in describing the
+      origin of the Work and reproducing the content of the NOTICE file.
+   7. Disclaimer of Warranty. Unless required by applicable law or
+      agreed to in writing, Licensor provides the Work (and each
+      Contributor provides its Contributions) on an "AS IS" BASIS,
+      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
+      implied, including, without limitation, any warranties or conditions
+      of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
+      PARTICULAR PURPOSE. You are solely responsible for determining the
+      appropriateness of using or redistributing the Work and assume any
+      risks associated with Your exercise of permissions under this License.
+   8. Limitation of Liability. In no event and under no legal theory,
+      whether in tort (including negligence), contract, or otherwise,
+      unless required by applicable law (such as deliberate and grossly
+      negligent acts) or agreed to in writing, shall any Contributor be
+      liable to You for damages, including any direct, indirect, special,
+      incidental, or consequential damages of any character arising as a
+      result of this License or out of the use or inability to use the
+      Work (including but not limited to damages for loss of goodwill,
+      work stoppage, computer failure or malfunction, or any and all
+      other commercial damages or losses), even if such Contributor
+      has been advised of the possibility of such damages.
+   9. Accepting Warranty or Additional Liability. While redistributing
+      the Work or Derivative Works thereof, You may choose to offer,
+      and charge a fee for, acceptance of support, warranty, indemnity,
+      or other liability obligations and/or rights consistent with this
+      License. However, in accepting such obligations, You may act only
+      on Your own behalf and on Your sole responsibility, not on behalf
+      of any other Contributor, and only if You agree to indemnify,
+      defend, and hold each Contributor harmless for any liability
+      incurred by, or claims asserted against, such Contributor by reason
+      of your accepting any such warranty or additional liability.
+   END OF TERMS AND CONDITIONS
+   APPENDIX: How to apply the Apache License to your work.
+      To apply the Apache License to your work, attach the following
+      boilerplate notice, with the fields enclosed by brackets "[]"
+      replaced with your own identifying information. (Don't include
+      the brackets!)  The text should be enclosed in the appropriate
+      comment syntax for the file format. We also recommend that a
+      file or class name and description of purpose be included on the
+      same "printed page" as the copyright notice for easier
+      identification within third-party archives.
+   Copyright [yyyy] [name of copyright owner]
+   Licensed under the Apache License, Version 2.0 (the "License");
+   you may not use this file except in compliance with the License.
+   You may obtain a copy of the License at
+       http://www.apache.org/licenses/LICENSE-2.0
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License.
--- a/README.md
+++ b/README.md
+# AF2
+## 论文
+- [https://www.nature.com/articles/s41586-021-03819-2](https://www.nature.com/articles/s41586-021-03819-2)
+## 模型结构
+模型核心是一个基于Transformer架构的神经网络，包括两个主要组件：Sequence to Sequence Model和Structure Model，这两个组件通过迭代训练进行优化，以提高其预测准确性。
+![img](./docs/alphafold2.png)
+## 算法原理
+AlphaFold2通过从蛋白质序列和结构数据中提取信息，使用神经网络模型来预测蛋白质三维结构。
+![img](./docs/alphafold2_1.png)
+## 环境配置
+### Docker（方法一）
+    # 使用该方法不需要下载本仓库，镜像中已包含可运行代码，但需要挂载相应的数据文件
+    docker pull image.sourcefind.cn:5000/dcu/admin/base/custom:alphafold2-dtk24.04.1-py310
+    docker run --shm-size 100g --network=host --name=alphafold2 --privileged --device=/dev/kfd --device=/dev/dri --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -v 本地数据地址:镜像数据地址 -v /opt/hyhal:/opt/hyhal:ro -it <your IMAGE ID> bash
+### Docker（方法二）
+    docker pull image.sourcefind.cn:5000/dcu/admin/base/jax:0.4.23-ubuntu20.04-dtk24.04.1-py3.10
+    docker run --shm-size 50g --network=host --name=alphafold2 --privileged --device=/dev/kfd --device=/dev/dri --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -v 项目地址(绝对路径):/home/ -v /opt/hyhal:/opt/hyhal:ro -it <your IMAGE ID> bash
+    # 1. 一般依赖项安装
+    pip install -r requirements_dcu.txt
+    pip install dm-haiku==0.0.11 flax==0.7.1 jmp==0.0.2 tabulate==0.8.9 --no-deps jax
+    pip install orbax==0.1.6 orbax-checkpoint==0.1.6 optax==0.2.2
+    python setup.py install
+    # 2、hh-suite 
+    git clone https://github.com/soedinglab/hh-suite.git
+    mkdir -p hh-suite/build && cd hh-suite/build
+    cmake -DCMAKE_INSTALL_PREFIX=. ..
+    make -j 4 && make install
+    export PATH="$(pwd)/bin:$(pwd)/scripts:$PATH"
+    wget https://github.com/TimoLassmann/kalign/archive/refs/tags/v3.4.0.zip
+    unzip 3.4.0.zip && cd kalign-3.4.0
+    mkdir build 
+    cd build
+    cmake .. 
+    make 
+    make test 
+    make install
+    # 3. openmm + pdbfixer
+    sudo apt install doxygen
+    wget https://github.com/openmm/openmm/archive/refs/tags/8.0.0.zip
+    unzip 8.0.0.zip && cd openmm-8.0.0 && mkdir build && cd build
+    cmake .. && make && sudo make install && sudo make PythonInstall
+    wget https://github.com/openmm/pdbfixer/archive/refs/tags/1.9.zip
+    unzip 1.9.zip && cd pdbfixer-1.9 && python setup.py install 
+## 数据集
+推荐使用AlphaFold2中的开源数据集，包括BFD、MGnify、PDB70、Uniclust、Uniref90等,数据集大小约2.62TB。数据集格式如下：
+```
+$DOWNLOAD_DIR/                             
+    bfd/  
+        bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_hhm.ffindex
+        bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_hhm.ffdata 
+        bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_cs219.ffindex                           
+        ...
+    mgnify/                                
+        mgy_clusters_2022_05.fa
+    params/                                
+        params_model_1.npz
+        params_model_2.npz
+        params_model_3.npz
+        ...
+    pdb70/                                
+        pdb_filter.dat
+        pdb70_hhm.ffindex
+        pdb70_hhm.ffdata
+        ...
+    pdb_mmcif/                            
+        mmcif_files/
+            100d.cif
+            101d.cif
+            101m.cif
+            ...
+        obsolete.dat
+    pdb_seqres/                            
+        pdb_seqres.txt
+    small_bfd/                           
+        bfd-first_non_consensus_sequences.fasta
+    uniref30/                            
+        UniRef30_2021_03_hhm.ffindex
+        UniRef30_2021_03_hhm.ffdata
+        UniRef30_2021_03_cs219.ffindex
+        ...
+    uniprot/                               
+        uniprot.fasta
+    uniref90/                             
+        uniref90.fasta
+```
+此处提供了一个脚本download_all_data.sh用于下载使用的数据集和模型文件：
+    ./scripts/download_all_data.sh 数据集下载目录
+数据集快速下载中心：[SCNet AIDatasets](http://113.200.138.88:18080/aidatasets) ，项目中数据集可从快速下载通道下载：[alphafold](http://113.200.138.88:18080/aidatasets/project-dependency/alphafold) 。
+## 推理
+分别提供了基于Jax的单体和多体的推理脚本.
+```bash
+    # 进入工程目录
+    cd alphafold2_jax
+```
+### 单体
+```bash
+    ./run_monomer.sh
+```
+单体推理参数说明：download_dir为数据集下载目录，monomer.fasta为推理的单体序列；`--output_dir`为输出目录；`model_names`为推理的模型名称，`--model_preset=monomer`为单体模型配置；`--run_relax=true`为进行relax操作；`--use_gpu_relax=true`为使用gpu进行relax操作（速度更快，但可能不太稳定），`--use_gpu_relax=false`为使用CPU进行relax操作（速度慢，但稳定）；若添加--use_precomputed_msas=true则可以加载已有的MSAs，否则默认运行MSA工具。
+### 多体
+```bash
+    ./run_multimer.sh
+```
+多体推理参数说明：multimer.fasta为推理的多体序列，`--model_preset=multimer`为多体模型配置；`--num_multimer_predictions_per_model`为每个模型预测数量，其他参数同单体推理参数说明一致。
+## result
+`--output_dir`目录结构如下：
+```
+<target_name>/
+    features.pkl
+    ranked_{0,1,2,3,4}.pdb
+    ranking_debug.json
+    relaxed_model_{1,2,3,4,5}.pdb
+    result_model_{1,2,3,4,5}.pkl
+    timings.json
+    unrelaxed_model_{1,2,3,4,5}.pdb
+    msas/
+        bfd_uniclust_hits.a3m
+        mgnify_hits.sto
+        uniref90_hits.sto
+        ...
+```
+[查看蛋白质3D结构](https://www.pdbus.org/3d-view)
+ID: 8U23
+蓝色的为预测结构，黄色为真实结构
+![alt text](image.png)
+### 精度
+无
+## 应用场景
+### 算法类别
+蛋白质预测
+### 热点应用行业
+医疗,科研,教育
+## 预训练权重
+预训练权重快速下载中心：[SCNet AIModels](http://113.200.138.88:18080/aimodels) ，项目中的预训练权重可从快速下载通道下载：[alphafold](http://113.200.138.88:18080/aimodels/findsource-dependency/alphafold-params) 。
+## 源码仓库及问题反馈
+* [https://developer.hpccube.com/codes/modelzoo/alphafold2_jax](https://developer.hpccube.com/codes/modelzoo/alphafold2_jax)
+## 参考
+* [https://github.com/deepmind/alphafold](https://github.com/deepmind/alphafold)
--- a/README_official.md
+++ b/README_official.md
--- a/afdb/README.md
+++ b/afdb/README.md
+# AlphaFold Protein Structure Database
+## Introduction
+The AlphaFold UniProt release (214M predictions) is hosted on
+[Google Cloud Public Datasets](https://console.cloud.google.com/marketplace/product/bigquery-public-data/deepmind-alphafold),
+and is available to download at no cost under a
+[CC-BY-4.0 licence](http://creativecommons.org/licenses/by/4.0/legalcode). The
+dataset is in a Cloud Storage bucket, and metadata is available on BigQuery. A
+Google Cloud account is required for the download, but the data can be freely
+used under the terms of the
+[CC-BY 4.0 Licence](http://creativecommons.org/licenses/by/4.0/legalcode).
+This document provides an overview of how to access and download the dataset for
+different use cases. Please refer to the [AlphaFold database FAQ](https://www.alphafold.com/faq)
+for further information on what proteins are in the database and a changelog of
+releases.
+:ledger: **Note: The full dataset is difficult to manipulate without significant
+computational resources (the size of the dataset is 23 TiB, 3 * 214M files).**
+There are also alternatives to downloading the full dataset:
+1.  Download a premade subset (covering important species / Swiss-Prot) via our
+    [download page](https://alphafold.ebi.ac.uk/download).
+2.  Download a custom subset of the data. See below.
+If you need to download the full dataset then please see the "Bulk download"
+section. See "Creating a Google Cloud Account" below for more information on how
+to avoid any surprise costs when using Google Cloud Public Datasets.
+## Licence
+Data is available for academic and commercial use, under a
+[CC-BY-4.0 licence](http://creativecommons.org/licenses/by/4.0/legalcode).
+EMBL-EBI expects attribution (e.g. in publications, services or products) for
+any of its online services, databases or software in accordance with good
+scientific practice.
+If you make use of an AlphaFold prediction, please cite the following papers:
+*   [Jumper, J et al. Highly accurate protein structure prediction with
+    AlphaFold. Nature
+    (2021).](https://www.nature.com/articles/s41586-021-03819-2)
+*   [Varadi, M et al. AlphaFold Protein Structure Database: massively expanding
+    the structural coverage of protein-sequence space with high-accuracy models.
+    Nucleic Acids Research
+    (2021).](https://academic.oup.com/nar/advance-article/doi/10.1093/nar/gkab1061/6430488)
+AlphaFold Data Copyright (2022) DeepMind Technologies Limited.
+## Disclaimer
+The AlphaFold Data and other information provided on this site is for
+theoretical modelling only, caution should be exercised in its use. It is
+provided 'as-is' without any warranty of any kind, whether expressed or implied.
+For clarity, no warranty is given that use of the information shall not infringe
+the rights of any third party. The information is not intended to be a
+substitute for professional medical advice, diagnosis, or treatment, and does
+not constitute medical or other professional advice.
+## Format
+Dataset file names start with a protein identifier of the form `AF-[a UniProt
+accession]-F[a fragment number]`.
+Three files are provided for each entry:
+*   **model_v4.cif** – contains the atomic coordinates for the predicted protein
+    structure, along with some metadata. Useful references for this file format
+    are the [ModelCIF](https://github.com/ihmwg/ModelCIF) and
+    [PDBx/mmCIF](https://mmcif.wwpdb.org) project sites.
+*   **confidence_v4.json** – contains a confidence metric output by AlphaFold
+    called pLDDT. This provides a number for each residue, indicating how
+    confident AlphaFold is in the *local* surrounding structure. pLDDT ranges
+    from 0 to 100, where 100 is most confident. This is also contained in the
+    CIF file.
+*   **predicted_aligned_error_v4.json** – contains a confidence metric output by
+    AlphaFold called PAE. This provides a number for every pair of residues,
+    which is lower when AlphaFold is more confident in the relative position of
+    the two residues. PAE is more suitable than pLDDT for judging confidence in 
+    relative domain placements.
+    [See here](https://alphafold.ebi.ac.uk/faq#faq-7) for a description of the
+    format.
+Predictions grouped by NCBI taxonomy ID are available as
+`proteomes/proteome-tax_id-[TAX ID]-[SHARD ID]_v4.tar` within the same
+bucket.
+There are also two extra files stored in the bucket:
+*   `accession_ids.csv` – This file contains a list of all the UniProt
+    accessions that have predictions in AlphaFold DB. The file is in CSV format
+    and includes the following columns, separated by a comma:
+    *   UniProt accession, e.g. A8H2R3
+    *   First residue index (UniProt numbering), e.g. 1
+    *   Last residue index (UniProt numbering), e.g. 199
+    *   AlphaFold DB identifier, e.g. AF-A8H2R3-F1
+    *   Latest version, e.g. 4
+*   `sequences.fasta` – This file contains sequences for all proteins in the
+    current database version in FASTA format. The identifier rows start with
+    ">AFDB", followed by the AlphaFold DB identifier and the name of the
+    protein. The sequence rows contain the corresponding amino acid sequences.
+    Each sequence is on a single line, i.e. there is no wrapping.
+## Creating a Google Cloud Account
+Downloading from the Google Cloud Public Datasets (rather than from AFDB or 3D
+Beacons) requires a Google Cloud account. See the
+[Google Cloud get started](https://cloud.google.com/docs/get-started) page, and
+explore the [free tier account usage limits](https://cloud.google.com/free).
+**IMPORTANT: After the trial period has finished (90 days), to continue access,
+you are required to upgrade to a billing account. While your free tier access
+(including access to the Public Datasets storage bucket) continues, usage beyond
+the free tier will incur costs – please familiarise yourself with the pricing
+for the services that you use to avoid any surprises.**
+1.  Go to
+    [https://cloud.google.com/datasets](https://cloud.google.com/datasets).
+2.  Create an account:
+    1.  Click "get started for free" in the top right corner.
+    2.  Agree to all terms of service.
+    3.  Follow the setup instructions. Note that a payment method is required,
+        but this will not be used unless you enable billing.
+    4.  Access to the Google Cloud Public Datasets storage bucket is always at
+        no cost and you will have access to the
+        [free tier.](https://cloud.google.com/free/docs/gcp-free-tier#free-tier-usage-limits)
+3.  Set up a project:
+    1.  In the top left corner, click the navigation menu (three horizontal bars
+        icon).
+    2.  Select: "Cloud overview" -> "Dashboard".
+    3.  In the top left corner there is a project menu bar (likely says "My
+        First Project"). Select this and a "Select a Project" box will appear.
+    4.  To keep using this project, click "Cancel" at the bottom of the box.
+    5.  To create a new project, click "New Project" at the top of the box:
+        1.  Select a project name.
+        2.  For location, if your organization has a Cloud account then select
+            this, otherwise leave as is.
+4.  Install `gsutil`:
+    1.  Follow these
+        [instructions](https://cloud.google.com/storage/docs/gsutil_install).
+## Accessing the dataset
+The data is available from:
+*   GCS data bucket:
+    [gs://public-datasets-deepmind-alphafold-v4](https://console.cloud.google.com/storage/browser/public-datasets-deepmind-alphafold-v4)
+## Bulk download
+We don't recommend downloading the full dataset unless required for processing
+with local computational resources, for example in an academic high performance
+computing centre.
+We estimate that a 1 Gbps internet connection will allow download of the full
+database in roughly 2.5 days.
+While we don’t know the exact nature of your computational infrastructure, below
+are some suggested approaches for downloading the dataset. Please reach out to
+[alphafold@deepmind.com](mailto:alphafold@deepmind.com) if you have any
+questions.
+The recommended way of downloading the whole database is by downloading
+1,015,797 sharded proteome tar files using the command below. This is
+significantly faster than downloading all of the individual files because of
+large constant per-file latency.
+```bash
+gsutil -m cp -r gs://public-datasets-deepmind-alphafold-v4/proteomes/ .
+```
+You will then have to un-tar all of the proteomes and un-gzip all of the
+individual files. Note that after un-taring, there will be about 644M files, so
+make sure your filesystem can handle this.
+### Storage Transfer Service
+Some users might find the
+[Storage Transfer Service](https://cloud.google.com/storage-transfer-service) a
+convenient way to set up the transfer between this bucket and another bucket, or
+another cloud service. *Using this service may incur costs*. Please check the
+[pricing page](https://cloud.google.com/storage-transfer/pricing) for more
+detail, particularly for transfers to other cloud services.
+## Downloading subsets of the data
+### AlphaFold Database search
+For simple queries, for example by protein name, gene name or UniProt accession
+you can use the main search bar on
+[alphafold.ebi.ac.uk](https://alphafold.ebi.ac.uk).
+### 3D Beacons
+[3D-Beacons](https://3d-beacons.org) is an international collaboration of
+protein structure data providers to create a federated network with unified data
+access mechanisms. The 3D-Beacons platform allows users to retrieve coordinate
+files and metadata of experimentally determined and theoretical protein models
+from data providers such as AlphaFold DB.
+More information about how to access AlphaFold predictions using 3D-Beacons is
+available at
+[3D-Beacons documentation](https://www.ebi.ac.uk/pdbe/pdbe-kb/3dbeacons/docs).
+### Other premade species subsets
+Downloads for some model organism proteomes, global health proteomes and
+Swiss-Prot are available on the
+[AFDB website](https://alphafold.ebi.ac.uk/download). These are generated from
+[reference proteomes](https://www.uniprot.org/help/reference_proteome). If you
+want other species, or *all* proteins for a particular species, please continue
+reading.
+We provide 1,015,797 sharded tar files for all species in
+[gs://public-datasets-deepmind-alphafold-v4/proteomes/](https://console.cloud.google.com/storage/browser/public-datasets-deepmind-alphafold-v4/proteomes/).
+We shard each proteome so that each shard contains at most 10,000 proteins
+(which corresponds to 30,000 files per shard, since there are 3 files per
+protein). To download a proteome of your choice, you have to do the following
+steps:
+1.  Find the [NCBI taxonomy ID](https://www.ncbi.nlm.nih.gov/taxonomy)
+    (`[TAX_ID]`) of the species in question.
+2.  Run `gsutil -m cp
+    gs://public-datasets-deepmind-alphafold-v4/proteomes/proteome-tax_id-[TAX
+    ID]-*_v4.tar .` to download all shards for this proteome.
+3.  Un-tar all of the downloaded files and un-gzip all of the individual files.
+### File manifests
+Pre-made lists of files (manifests) are available at
+[gs://public-datasets-deepmind-alphafold-v4/manifests](https://console.cloud.google.com/storage/browser/public-datasets-deepmind-alphafold-v4/manifests/).
+Note that these filenames do not include the bucket prefix, but this can be
+added once the files have been downloaded to your filesystem.
+One can also define their own list of files, for example created by BigQuery
+(see below). `gsutil` can be used to download these files with
+```bash
+cat [manifest file] | gsutil -m cp -I .
+```
+This will be much slower than downloading the tar files (grouped by species)
+because each file has an associated overhead.
+### BigQuery
+**IMPORTANT: The
+[free tier](https://cloud.google.com/bigquery/pricing#free-tier) of Google Cloud
+comes with [BigQuery Sandbox](https://cloud.google.com/bigquery/docs/sandbox)
+with 1 TB of free processed query data each month. Repeated queries within a
+month could exceed this limit and if you have
+[upgraded to a paid Cloud Billing account](https://cloud.google.com/free/docs/gcp-free-tier#how-to-upgrade)
+you may be charged.**
+**This should be sufficient for running a number of queries on the metadata
+table, though the usage depends on the size of the columns queried and selected.
+Please look at the
+[BigQuery pricing page](https://cloud.google.com/bigquery/pricing) for more
+information.**
+**This is the user's responsibility so please ensure you keep track of your
+billing settings and resource usage in the console.**
+BigQuery provides a serverless and highly scalable analytics tool enabling SQL
+queries over large datasets. The metadata for the UniProt dataset takes up 113
+GiB and so can be challenging to process and analyse locally. The table name is:
+*   BigQuery metadata table:
+    [bigquery-public-data.deepmind_alphafold.metadata](https://console.cloud.google.com/bigquery?project=bigquery-public-data&ws=!1m5!1m4!4m3!1sbigquery-public-data!2sdeepmind_alphafold!3smetadata)
+With BigQuery SQL you can do complex queries, e.g. find all high accuracy
+predictions for a particular species, or even join on to other datasets, e.g. to
+an experimental dataset by the `uniprotSequence`, or to the NCBI taxonomy by
+`taxId`.
+If you would find additional information in the metadata useful please file a
+GitHub issue.
+#### Setup
+Follow the
+[BigQuery Sandbox set up guide](https://cloud.google.com/bigquery/docs/sandbox).
+#### Exploring the metadata
+The column names and associated data types available can be found using the
+following query.
+```sql
+SELECT column_name, data_type FROM bigquery-public-data.deepmind_alphafold.INFORMATION_SCHEMA.COLUMNS
+WHERE table_name = 'metadata'
+```
+**Column name**        | **Data type**   | **Description**
+---------------------- | --------------- | ---------------
+allVersions            | `ARRAY<INT64>`  | An array of AFDB versions this prediction has had
+entryId                | `STRING`        | The AFDB entry ID, e.g. "AF-Q1HGU3-F1"
+fractionPlddtConfident | `FLOAT64`       | Fraction of the residues in the prediction with pLDDT between 70 and 90
+fractionPlddtLow       | `FLOAT64`       | Fraction of the residues in the prediction with pLDDT between 50 and 70
+fractionPlddtVeryHigh  | `FLOAT64`       | Fraction of the residues in the prediction with pLDDT greater than 90
+fractionPlddtVeryLow   | `FLOAT64`       | Fraction of the residues in the prediction with pLDDT less than 50
+gene                   | `STRING`        | The name of the gene if known, e.g. "COII"
+geneSynonyms           | `ARRAY<STRING>` | Additional synonyms for the gene
+globalMetricValue      | `FLOAT64`       | The mean pLDDT of this prediction
+isReferenceProteome    | `BOOL`          | Is this protein part of the reference proteome?
+isReviewed             | `BOOL`          | Has this protein been reviewed, i.e. is it part of SwissProt?
+latestVersion          | `INT64`         | The latest AFDB version for this prediction
+modelCreatedDate       | `DATE`          | The date of creation for this entry, e.g. "2022-06-01"
+organismCommonNames    | `ARRAY<STRING>` | List of common organism names
+organismScientificName | `STRING`        | The scientific name of the organism
+organismSynonyms       | `ARRAY<STRING>` | List of synonyms for the organism
+proteinFullNames       | `ARRAY<STRING>` | Full names of the protein
+proteinShortNames      | `ARRAY<STRING>` | Short names of the protein
+sequenceChecksum       | `STRING`        | [CRC64 hash](https://www.uniprot.org/help/checksum) of the sequence. Can be used for cheaper lookups.
+sequenceVersionDate    | `DATE`          | Date when the sequence data was last modified in UniProt
+taxId                  | `INT64`         | NCBI taxonomy id of the originating species
+uniprotAccession       | `STRING`        | Uniprot accession ID
+uniprotDescription     | `STRING`        | The name recommended by the UniProt consortium
+uniprotEnd             | `INT64`         | Number of the last residue in the entry relative to the UniProt entry. This is equal to the length of the protein unless we are dealing with protein fragments.
+uniprotId              | `STRING`        | The Uniprot EntryName field
+uniprotSequence        | `STRING`        | Amino acid sequence for this prediction
+uniprotStart           | `INT64`         | Number of the first residue in the entry relative to the UniProt entry. This is 1 unless we are dealing with protein fragments.
+#### Producing summary statistics
+The following query gives the mean of the prediction confidence fractions per
+species.
+```sql
+SELECT
+ organismScientificName AS name,
+ SUM(fractionPlddtVeryLow) / COUNT(fractionPlddtVeryLow) AS mean_plddt_very_low,
+ SUM(fractionPlddtLow) / COUNT(fractionPlddtLow) AS mean_plddt_low,
+ SUM(fractionPlddtConfident) / COUNT(fractionPlddtConfident) AS mean_plddt_confident,
+ SUM(fractionPlddtVeryHigh) / COUNT(fractionPlddtVeryHigh) AS mean_plddt_very_high,
+ COUNT(organismScientificName) AS num_predictions
+FROM bigquery-public-data.deepmind_alphafold.metadata
+GROUP by name
+ORDER BY num_predictions DESC;
+```
+#### Producing lists of files
+We expect that the most important use for the metadata will be to create subsets
+of proteins according to various criteria, so that users can choose to only copy
+a subset of the 214M proteins that exist in the dataset. An example query is
+given below:
+```sql
+with file_rows AS (
+  with file_cols AS (
+    SELECT
+      CONCAT(entryID, '-model_v4.cif') as m,
+      CONCAT(entryID, '-predicted_aligned_error_v4.json') as p
+    FROM bigquery-public-data.deepmind_alphafold.metadata
+    WHERE organismScientificName = "Homo sapiens"
+      AND (fractionPlddtVeryHigh + fractionPlddtConfident) > 0.5
+  )
+  SELECT * FROM file_cols UNPIVOT (files for filetype in (m, p))
+)
+SELECT CONCAT('gs://public-datasets-deepmind-alphafold-v4/', files) as files
+from file_rows
+```
+In this case, the list has been filtered to only include proteins from *Homo
+sapiens* for which over half the residues are confident or better (>70 pLDDT).
+This creates a table with one column "files", where each row is the cloud
+location of one of the two file types that has been provided for each protein.
+There is an additional `confidence_v4.json` file which contains the
+per-residue pLDDT. This information is already in the CIF file but may be
+preferred if only this information is required.
+This allows users to bulk download the exact proteins they need, without having
+to download the entire dataset. Other columns may also be used to select subsets
+of proteins, and we point the user to the
+[BigQuery documentation](https://cloud.google.com/bigquery/docs) to understand
+other ways to filter for their desired protein lists. Likewise, the
+documentation should be followed to download these file subsets locally, as the
+most appropriate approach will depend on the filesize. Note that it may be
+easier to download large files using [Colab](https://colab.research.google.com/)
+(e.g. pandas to_csv).
+#### Previous versions
+Previous versions of AFDB will remain available at
+[gs://public-datasets-deepmind-alphafold](https://console.cloud.google.com/storage/browser/public-datasets-deepmind-alphafold)
+to enable reproducible research. We recommend using the latest version (v4).
\ No newline at end of file
--- a/alphafold/__init__.py
+++ b/alphafold/__init__.py
+# Copyright 2021 DeepMind Technologies Limited
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#      http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""An implementation of the inference pipeline of AlphaFold v2.0."""
--- a/alphafold/common/__init__.py
+++ b/alphafold/common/__init__.py
+# Copyright 2021 DeepMind Technologies Limited
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#      http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Common data types and constants used within Alphafold."""
--- a/alphafold/common/confidence.py
+++ b/alphafold/common/confidence.py
+# Copyright 2021 DeepMind Technologies Limited
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#      http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Functions for processing confidence metrics."""
+import json
+from typing import Dict, Optional, Tuple
+import numpy as np
+import scipy.special
+def compute_plddt(logits: np.ndarray) -> np.ndarray:
+  """Computes per-residue pLDDT from logits.
+  Args:
+    logits: [num_res, num_bins] output from the PredictedLDDTHead.
+  Returns:
+    plddt: [num_res] per-residue pLDDT.
+  """
+  num_bins = logits.shape[-1]
+  bin_width = 1.0 / num_bins
+  bin_centers = np.arange(start=0.5 * bin_width, stop=1.0, step=bin_width)
+  probs = scipy.special.softmax(logits, axis=-1)
+  predicted_lddt_ca = np.sum(probs * bin_centers[None, :], axis=-1)
+  return predicted_lddt_ca * 100
+def _confidence_category(score: float) -> str:
+  """Categorizes pLDDT into: disordered (D), low (L), medium (M), high (H)."""
+  if 0 <= score < 50:
+    return 'D'
+  if 50 <= score < 70:
+    return 'L'
+  elif 70 <= score < 90:
+    return 'M'
+  elif 90 <= score <= 100:
+    return 'H'
+  else:
+    raise ValueError(f'Invalid pLDDT score {score}')
+def confidence_json(plddt: np.ndarray) -> str:
+  """Returns JSON with confidence score and category for every residue.
+  Args:
+    plddt: Per-residue confidence metric data.
+  Returns:
+    String with a formatted JSON.
+  Raises:
+    ValueError: If `plddt` has a rank different than 1.
+  """
+  if plddt.ndim != 1:
+    raise ValueError(f'The plddt array must be rank 1, got: {plddt.shape}.')
+  confidence = {
+      'residueNumber': list(range(1, len(plddt) + 1)),
+      'confidenceScore': [round(float(s), 2) for s in plddt],
+      'confidenceCategory': [_confidence_category(s) for s in plddt],
+  }
+  return json.dumps(confidence, indent=None, separators=(',', ':'))
+def _calculate_bin_centers(breaks: np.ndarray):
+  """Gets the bin centers from the bin edges.
+  Args:
+    breaks: [num_bins - 1] the error bin edges.
+  Returns:
+    bin_centers: [num_bins] the error bin centers.
+  """
+  step = (breaks[1] - breaks[0])
+  # Add half-step to get the center
+  bin_centers = breaks + step / 2
+  # Add a catch-all bin at the end.
+  bin_centers = np.concatenate([bin_centers, [bin_centers[-1] + step]],
+                               axis=0)
+  return bin_centers
+def _calculate_expected_aligned_error(
+    alignment_confidence_breaks: np.ndarray,
+    aligned_distance_error_probs: np.ndarray) -> Tuple[np.ndarray, np.ndarray]:
+  """Calculates expected aligned distance errors for every pair of residues.
+  Args:
+    alignment_confidence_breaks: [num_bins - 1] the error bin edges.
+    aligned_distance_error_probs: [num_res, num_res, num_bins] the predicted
+      probs for each error bin, for each pair of residues.
+  Returns:
+    predicted_aligned_error: [num_res, num_res] the expected aligned distance
+      error for each pair of residues.
+    max_predicted_aligned_error: The maximum predicted error possible.
+  """
+  bin_centers = _calculate_bin_centers(alignment_confidence_breaks)
+  # Tuple of expected aligned distance error and max possible error.
+  return (np.sum(aligned_distance_error_probs * bin_centers, axis=-1),
+          np.asarray(bin_centers[-1]))
+def compute_predicted_aligned_error(
+    logits: np.ndarray,
+    breaks: np.ndarray) -> Dict[str, np.ndarray]:
+  """Computes aligned confidence metrics from logits.
+  Args:
+    logits: [num_res, num_res, num_bins] the logits output from
+      PredictedAlignedErrorHead.
+    breaks: [num_bins - 1] the error bin edges.
+  Returns:
+    aligned_confidence_probs: [num_res, num_res, num_bins] the predicted
+      aligned error probabilities over bins for each residue pair.
+    predicted_aligned_error: [num_res, num_res] the expected aligned distance
+      error for each pair of residues.
+    max_predicted_aligned_error: The maximum predicted error possible.
+  """
+  aligned_confidence_probs = scipy.special.softmax(
+      logits,
+      axis=-1)
+  predicted_aligned_error, max_predicted_aligned_error = (
+      _calculate_expected_aligned_error(
+          alignment_confidence_breaks=breaks,
+          aligned_distance_error_probs=aligned_confidence_probs))
+  return {
+      'aligned_confidence_probs': aligned_confidence_probs,
+      'predicted_aligned_error': predicted_aligned_error,
+      'max_predicted_aligned_error': max_predicted_aligned_error,
+  }
+def pae_json(pae: np.ndarray, max_pae: float) -> str:
+  """Returns the PAE in the same format as is used in the AFDB.
+  Note that the values are presented as floats to 1 decimal place, whereas AFDB
+  returns integer values.
+  Args:
+    pae: The n_res x n_res PAE array.
+    max_pae: The maximum possible PAE value.
+  Returns:
+    PAE output format as a JSON string.
+  """
+  # Check the PAE array is the correct shape.
+  if pae.ndim != 2 or pae.shape[0] != pae.shape[1]:
+    raise ValueError(f'PAE must be a square matrix, got {pae.shape}')
+  # Round the predicted aligned errors to 1 decimal place.
+  rounded_errors = np.round(pae.astype(np.float64), decimals=1)
+  formatted_output = [{
+      'predicted_aligned_error': rounded_errors.tolist(),
+      'max_predicted_aligned_error': max_pae,
+  }]
+  return json.dumps(formatted_output, indent=None, separators=(',', ':'))
+def predicted_tm_score(
+    logits: np.ndarray,
+    breaks: np.ndarray,
+    residue_weights: Optional[np.ndarray] = None,
+    asym_id: Optional[np.ndarray] = None,
+    interface: bool = False) -> np.ndarray:
+  """Computes predicted TM alignment or predicted interface TM alignment score.
+  Args:
+    logits: [num_res, num_res, num_bins] the logits output from
+      PredictedAlignedErrorHead.
+    breaks: [num_bins] the error bins.
+    residue_weights: [num_res] the per residue weights to use for the
+      expectation.
+    asym_id: [num_res] the asymmetric unit ID - the chain ID. Only needed for
+      ipTM calculation, i.e. when interface=True.
+    interface: If True, interface predicted TM score is computed.
+  Returns:
+    ptm_score: The predicted TM alignment or the predicted iTM score.
+  """
+  # residue_weights has to be in [0, 1], but can be floating-point, i.e. the
+  # exp. resolved head's probability.
+  if residue_weights is None:
+    residue_weights = np.ones(logits.shape[0])
+  bin_centers = _calculate_bin_centers(breaks)
+  num_res = int(np.sum(residue_weights))
+  # Clip num_res to avoid negative/undefined d0.
+  clipped_num_res = max(num_res, 19)
+  # Compute d_0(num_res) as defined by TM-score, eqn. (5) in Yang & Skolnick
+  # "Scoring function for automated assessment of protein structure template
+  # quality", 2004: http://zhanglab.ccmb.med.umich.edu/papers/2004_3.pdf
+  d0 = 1.24 * (clipped_num_res - 15) ** (1./3) - 1.8
+  # Convert logits to probs.
+  probs = scipy.special.softmax(logits, axis=-1)
+  # TM-Score term for every bin.
+  tm_per_bin = 1. / (1 + np.square(bin_centers) / np.square(d0))
+  # E_distances tm(distance).
+  predicted_tm_term = np.sum(probs * tm_per_bin, axis=-1)
+  pair_mask = np.ones(shape=(num_res, num_res), dtype=bool)
+  if interface:
+    pair_mask *= asym_id[:, None] != asym_id[None, :]
+  predicted_tm_term *= pair_mask
+  pair_residue_weights = pair_mask * (
+      residue_weights[None, :] * residue_weights[:, None])
+  normed_residue_mask = pair_residue_weights / (1e-8 + np.sum(
+      pair_residue_weights, axis=-1, keepdims=True))
+  per_alignment = np.sum(predicted_tm_term * normed_residue_mask, axis=-1)
+  return np.asarray(per_alignment[(per_alignment * residue_weights).argmax()])
--- a/alphafold/common/confidence_test.py
+++ b/alphafold/common/confidence_test.py
+# Copyright 2023 DeepMind Technologies Limited
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#      http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Test confidence metrics."""
+from absl.testing import absltest
+from alphafold.common import confidence
+import numpy as np
+class ConfidenceTest(absltest.TestCase):
+  def test_pae_json(self):
+    pae = np.array([[0.01, 13.12345], [20.0987, 0.0]])
+    pae_json = confidence.pae_json(pae=pae, max_pae=31.75)
+    self.assertEqual(
+        pae_json, '[{"predicted_aligned_error":[[0.0,13.1],[20.1,0.0]],'
+        '"max_predicted_aligned_error":31.75}]')
+  def test_confidence_json(self):
+    plddt = np.array([42, 42.42])
+    confidence_json = confidence.confidence_json(plddt=plddt)
+    print(confidence_json)
+    self.assertEqual(
+        confidence_json,
+        ('{"residueNumber":[1,2],'
+         '"confidenceScore":[42.0,42.42],'
+         '"confidenceCategory":["D","D"]}'),
+    )
+if __name__ == '__main__':
+  absltest.main()
--- a/alphafold/common/mmcif_metadata.py
+++ b/alphafold/common/mmcif_metadata.py
+# Copyright 2021 DeepMind Technologies Limited
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#      http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""mmCIF metadata."""
+from typing import Mapping, Sequence
+from alphafold import version
+import numpy as np
+_DISCLAIMER = """ALPHAFOLD DATA, COPYRIGHT (2021) DEEPMIND TECHNOLOGIES LIMITED.
+THE INFORMATION PROVIDED IS THEORETICAL MODELLING ONLY AND CAUTION SHOULD BE
+EXERCISED IN ITS USE. IT IS PROVIDED "AS-IS" WITHOUT ANY WARRANTY OF ANY KIND,
+WHETHER EXPRESSED OR IMPLIED. NO WARRANTY IS GIVEN THAT USE OF THE INFORMATION
+SHALL NOT INFRINGE THE RIGHTS OF ANY THIRD PARTY. DISCLAIMER: THE INFORMATION IS
+NOT INTENDED TO BE A SUBSTITUTE FOR PROFESSIONAL MEDICAL ADVICE, DIAGNOSIS, OR
+TREATMENT, AND DOES NOT CONSTITUTE MEDICAL OR OTHER PROFESSIONAL ADVICE. IT IS
+AVAILABLE FOR ACADEMIC AND COMMERCIAL PURPOSES, UNDER CC-BY 4.0 LICENCE."""
+# Authors of the Nature methods paper we reference in the mmCIF.
+_MMCIF_PAPER_AUTHORS = (
+    'Jumper, John',
+    'Evans, Richard',
+    'Pritzel, Alexander',
+    'Green, Tim',
+    'Figurnov, Michael',
+    'Ronneberger, Olaf',
+    'Tunyasuvunakool, Kathryn',
+    'Bates, Russ',
+    'Zidek, Augustin',
+    'Potapenko, Anna',
+    'Bridgland, Alex',
+    'Meyer, Clemens',
+    'Kohl, Simon A. A.',
+    'Ballard, Andrew J.',
+    'Cowie, Andrew',
+    'Romera-Paredes, Bernardino',
+    'Nikolov, Stanislav',
+    'Jain, Rishub',
+    'Adler, Jonas',
+    'Back, Trevor',
+    'Petersen, Stig',
+    'Reiman, David',
+    'Clancy, Ellen',
+    'Zielinski, Michal',
+    'Steinegger, Martin',
+    'Pacholska, Michalina',
+    'Berghammer, Tamas',
+    'Silver, David',
+    'Vinyals, Oriol',
+    'Senior, Andrew W.',
+    'Kavukcuoglu, Koray',
+    'Kohli, Pushmeet',
+    'Hassabis, Demis',
+)
+# Authors of the mmCIF - we set them to be equal to the authors of the paper.
+_MMCIF_AUTHORS = _MMCIF_PAPER_AUTHORS
+def add_metadata_to_mmcif(
+    old_cif: Mapping[str, Sequence[str]], model_type: str
+) -> Mapping[str, Sequence[str]]:
+  """Adds AlphaFold metadata in the given mmCIF."""
+  cif = {}
+  # ModelCIF conformation dictionary.
+  cif['_audit_conform.dict_name'] = ['mmcif_ma.dic']
+  cif['_audit_conform.dict_version'] = ['1.3.9']
+  cif['_audit_conform.dict_location'] = [
+      'https://raw.githubusercontent.com/ihmwg/ModelCIF/master/dist/'
+      'mmcif_ma.dic'
+  ]
+  # License and disclaimer.
+  cif['_pdbx_data_usage.id'] = ['1', '2']
+  cif['_pdbx_data_usage.type'] = ['license', 'disclaimer']
+  cif['_pdbx_data_usage.details'] = [
+      'Data in this file is available under a CC-BY-4.0 license.',
+      _DISCLAIMER,
+  ]
+  cif['_pdbx_data_usage.url'] = [
+      'https://creativecommons.org/licenses/by/4.0/',
+      '?',
+  ]
+  cif['_pdbx_data_usage.name'] = ['CC-BY-4.0', '?']
+  # Structure author details.
+  cif['_audit_author.name'] = []
+  cif['_audit_author.pdbx_ordinal'] = []
+  for author_index, author_name in enumerate(_MMCIF_AUTHORS, start=1):
+    cif['_audit_author.name'].append(author_name)
+    cif['_audit_author.pdbx_ordinal'].append(str(author_index))
+  # Paper author details.
+  cif['_citation_author.citation_id'] = []
+  cif['_citation_author.name'] = []
+  cif['_citation_author.ordinal'] = []
+  for author_index, author_name in enumerate(_MMCIF_PAPER_AUTHORS, start=1):
+    cif['_citation_author.citation_id'].append('primary')
+    cif['_citation_author.name'].append(author_name)
+    cif['_citation_author.ordinal'].append(str(author_index))
+  # Paper citation details.
+  cif['_citation.id'] = ['primary']
+  cif['_citation.title'] = [
+      'Highly accurate protein structure prediction with AlphaFold'
+  ]
+  cif['_citation.journal_full'] = ['Nature']
+  cif['_citation.journal_volume'] = ['596']
+  cif['_citation.page_first'] = ['583']
+  cif['_citation.page_last'] = ['589']
+  cif['_citation.year'] = ['2021']
+  cif['_citation.journal_id_ASTM'] = ['NATUAS']
+  cif['_citation.country'] = ['UK']
+  cif['_citation.journal_id_ISSN'] = ['0028-0836']
+  cif['_citation.journal_id_CSD'] = ['0006']
+  cif['_citation.book_publisher'] = ['?']
+  cif['_citation.pdbx_database_id_PubMed'] = ['34265844']
+  cif['_citation.pdbx_database_id_DOI'] = ['10.1038/s41586-021-03819-2']
+  # Type of data in the dataset including data used in the model generation.
+  cif['_ma_data.id'] = ['1']
+  cif['_ma_data.name'] = ['Model']
+  cif['_ma_data.content_type'] = ['model coordinates']
+  # Description of number of instances for each entity.
+  cif['_ma_target_entity_instance.asym_id'] = old_cif['_struct_asym.id']
+  cif['_ma_target_entity_instance.entity_id'] = old_cif[
+      '_struct_asym.entity_id'
+  ]
+  cif['_ma_target_entity_instance.details'] = ['.'] * len(
+      cif['_ma_target_entity_instance.entity_id']
+  )
+  # Details about the target entities.
+  cif['_ma_target_entity.entity_id'] = cif[
+      '_ma_target_entity_instance.entity_id'
+  ]
+  cif['_ma_target_entity.data_id'] = ['1'] * len(
+      cif['_ma_target_entity.entity_id']
+  )
+  cif['_ma_target_entity.origin'] = ['.'] * len(
+      cif['_ma_target_entity.entity_id']
+  )
+  # Details of the models being deposited.
+  cif['_ma_model_list.ordinal_id'] = ['1']
+  cif['_ma_model_list.model_id'] = ['1']
+  cif['_ma_model_list.model_group_id'] = ['1']
+  cif['_ma_model_list.model_name'] = ['Top ranked model']
+  cif['_ma_model_list.model_group_name'] = [
+      f'AlphaFold {model_type} v{version.__version__} model'
+  ]
+  cif['_ma_model_list.data_id'] = ['1']
+  cif['_ma_model_list.model_type'] = ['Ab initio model']
+  # Software used.
+  cif['_software.pdbx_ordinal'] = ['1']
+  cif['_software.name'] = ['AlphaFold']
+  cif['_software.version'] = [f'v{version.__version__}']
+  cif['_software.type'] = ['package']
+  cif['_software.description'] = ['Structure prediction']
+  cif['_software.classification'] = ['other']
+  cif['_software.date'] = ['?']
+  # Collection of software into groups.
+  cif['_ma_software_group.ordinal_id'] = ['1']
+  cif['_ma_software_group.group_id'] = ['1']
+  cif['_ma_software_group.software_id'] = ['1']
+  # Method description to conform with ModelCIF.
+  cif['_ma_protocol_step.ordinal_id'] = ['1', '2', '3']
+  cif['_ma_protocol_step.protocol_id'] = ['1', '1', '1']
+  cif['_ma_protocol_step.step_id'] = ['1', '2', '3']
+  cif['_ma_protocol_step.method_type'] = [
+      'coevolution MSA',
+      'template search',
+      'modeling',
+  ]
+  # Details of the metrics use to assess model confidence.
+  cif['_ma_qa_metric.id'] = ['1', '2']
+  cif['_ma_qa_metric.name'] = ['pLDDT', 'pLDDT']
+  # Accepted values are distance, energy, normalised score, other, zscore.
+  cif['_ma_qa_metric.type'] = ['pLDDT', 'pLDDT']
+  cif['_ma_qa_metric.mode'] = ['global', 'local']
+  cif['_ma_qa_metric.software_group_id'] = ['1', '1']
+  # Global model confidence metric value.
+  cif['_ma_qa_metric_global.ordinal_id'] = ['1']
+  cif['_ma_qa_metric_global.model_id'] = ['1']
+  cif['_ma_qa_metric_global.metric_id'] = ['1']
+  global_plddt = np.mean(
+      [float(v) for v in old_cif['_atom_site.B_iso_or_equiv']]
+  )
+  cif['_ma_qa_metric_global.metric_value'] = [f'{global_plddt:.2f}']
+  cif['_atom_type.symbol'] = sorted(set(old_cif['_atom_site.type_symbol']))
+  return cif
--- a/alphafold/common/protein.py
+++ b/alphafold/common/protein.py
--- a/alphafold/common/protein_test.py
+++ b/alphafold/common/protein_test.py
+# Copyright 2021 DeepMind Technologies Limited
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#      http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Tests for protein."""
+import os
+from absl.testing import absltest
+from absl.testing import parameterized
+from alphafold.common import protein
+from alphafold.common import residue_constants
+import numpy as np
+# Internal import (7716).
+TEST_DATA_DIR = 'alphafold/common/testdata/'
+class ProteinTest(parameterized.TestCase):
+  def _check_shapes(self, prot, num_res):
+    """Check that the processed shapes are correct."""
+    num_atoms = residue_constants.atom_type_num
+    self.assertEqual((num_res, num_atoms, 3), prot.atom_positions.shape)
+    self.assertEqual((num_res,), prot.aatype.shape)
+    self.assertEqual((num_res, num_atoms), prot.atom_mask.shape)
+    self.assertEqual((num_res,), prot.residue_index.shape)
+    self.assertEqual((num_res,), prot.chain_index.shape)
+    self.assertEqual((num_res, num_atoms), prot.b_factors.shape)
+  @parameterized.named_parameters(
+      dict(testcase_name='chain_A',
+           pdb_file='2rbg.pdb', chain_id='A', num_res=282, num_chains=1),
+      dict(testcase_name='chain_B',
+           pdb_file='2rbg.pdb', chain_id='B', num_res=282, num_chains=1),
+      dict(testcase_name='multichain',
+           pdb_file='2rbg.pdb', chain_id=None, num_res=564, num_chains=2))
+  def test_from_pdb_str(self, pdb_file, chain_id, num_res, num_chains):
+    pdb_file = os.path.join(absltest.get_default_test_srcdir(), TEST_DATA_DIR,
+                            pdb_file)
+    with open(pdb_file) as f:
+      pdb_string = f.read()
+    prot = protein.from_pdb_string(pdb_string, chain_id)
+    self._check_shapes(prot, num_res)
+    self.assertGreaterEqual(prot.aatype.min(), 0)
+    # Allow equal since unknown restypes have index equal to restype_num.
+    self.assertLessEqual(prot.aatype.max(), residue_constants.restype_num)
+    self.assertLen(np.unique(prot.chain_index), num_chains)
+  def test_to_pdb(self):
+    with open(
+        os.path.join(absltest.get_default_test_srcdir(), TEST_DATA_DIR,
+                     '2rbg.pdb')) as f:
+      pdb_string = f.read()
+    prot = protein.from_pdb_string(pdb_string)
+    pdb_string_reconstr = protein.to_pdb(prot)
+    for line in pdb_string_reconstr.splitlines():
+      self.assertLen(line, 80)
+    prot_reconstr = protein.from_pdb_string(pdb_string_reconstr)
+    np.testing.assert_array_equal(prot_reconstr.aatype, prot.aatype)
+    np.testing.assert_array_almost_equal(
+        prot_reconstr.atom_positions, prot.atom_positions)
+    np.testing.assert_array_almost_equal(
+        prot_reconstr.atom_mask, prot.atom_mask)
+    np.testing.assert_array_equal(
+        prot_reconstr.residue_index, prot.residue_index)
+    np.testing.assert_array_equal(
+        prot_reconstr.chain_index, prot.chain_index)
+    np.testing.assert_array_almost_equal(
+        prot_reconstr.b_factors, prot.b_factors)
+  @parameterized.named_parameters(
+      dict(
+          testcase_name='glucagon',
+          pdb_file='glucagon.pdb',
+          model_type='Monomer',
+      ),
+      dict(testcase_name='7bui', pdb_file='5nmu.pdb', model_type='Multimer'),
+  )
+  def test_to_mmcif(self, pdb_file, model_type):
+    with open(
+        os.path.join(
+            absltest.get_default_test_srcdir(), TEST_DATA_DIR, pdb_file
+        )
+    ) as f:
+      pdb_string = f.read()
+    prot = protein.from_pdb_string(pdb_string)
+    file_id = 'test'
+    mmcif_string = protein.to_mmcif(prot, file_id, model_type)
+    prot_reconstr = protein.from_mmcif_string(mmcif_string)
+    np.testing.assert_array_equal(prot_reconstr.aatype, prot.aatype)
+    np.testing.assert_array_almost_equal(
+        prot_reconstr.atom_positions, prot.atom_positions
+    )
+    np.testing.assert_array_almost_equal(
+        prot_reconstr.atom_mask, prot.atom_mask
+    )
+    np.testing.assert_array_equal(
+        prot_reconstr.residue_index, prot.residue_index
+    )
+    np.testing.assert_array_equal(prot_reconstr.chain_index, prot.chain_index)
+    np.testing.assert_array_almost_equal(
+        prot_reconstr.b_factors, prot.b_factors
+    )
+  def test_ideal_atom_mask(self):
+    with open(
+        os.path.join(
+            absltest.get_default_test_srcdir(), TEST_DATA_DIR, '2rbg.pdb'
+        )
+    ) as f:
+      pdb_string = f.read()
+    prot = protein.from_pdb_string(pdb_string)
+    ideal_mask = protein.ideal_atom_mask(prot)
+    non_ideal_residues = set([102] + list(range(127, 286)))
+    for i, (res, atom_mask) in enumerate(
+        zip(prot.residue_index, prot.atom_mask)
+    ):
+      if res in non_ideal_residues:
+        self.assertFalse(np.all(atom_mask == ideal_mask[i]), msg=f'{res}')
+      else:
+        self.assertTrue(np.all(atom_mask == ideal_mask[i]), msg=f'{res}')
+  def test_too_many_chains(self):
+    num_res = protein.PDB_MAX_CHAINS + 1
+    num_atom_type = residue_constants.atom_type_num
+    with self.assertRaises(ValueError):
+      _ = protein.Protein(
+          atom_positions=np.random.random([num_res, num_atom_type, 3]),
+          aatype=np.random.randint(0, 21, [num_res]),
+          atom_mask=np.random.randint(0, 2, [num_res]).astype(np.float32),
+          residue_index=np.arange(1, num_res+1),
+          chain_index=np.arange(num_res),
+          b_factors=np.random.uniform(1, 100, [num_res]))
+if __name__ == '__main__':
+  absltest.main()
--- a/alphafold/common/residue_constants.py
+++ b/alphafold/common/residue_constants.py
--- a/alphafold/common/residue_constants_test.py
+++ b/alphafold/common/residue_constants_test.py
+# Copyright 2021 DeepMind Technologies Limited
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#      http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Test that residue_constants generates correct values."""
+from absl.testing import absltest
+from absl.testing import parameterized
+from alphafold.common import residue_constants
+import numpy as np
+class ResidueConstantsTest(parameterized.TestCase):
+  @parameterized.parameters(
+      ('ALA', 0),
+      ('CYS', 1),
+      ('HIS', 2),
+      ('MET', 3),
+      ('LYS', 4),
+      ('ARG', 4),
+  )
+  def testChiAnglesAtoms(self, residue_name, chi_num):
+    chi_angles_atoms = residue_constants.chi_angles_atoms[residue_name]
+    self.assertLen(chi_angles_atoms, chi_num)
+    for chi_angle_atoms in chi_angles_atoms:
+      self.assertLen(chi_angle_atoms, 4)
+  def testChiGroupsForAtom(self):
+    for k, chi_groups in residue_constants.chi_groups_for_atom.items():
+      res_name, atom_name = k
+      for chi_group_i, atom_i in chi_groups:
+        self.assertEqual(
+            atom_name,
+            residue_constants.chi_angles_atoms[res_name][chi_group_i][atom_i])
+  @parameterized.parameters(
+      ('ALA', 5), ('ARG', 11), ('ASN', 8), ('ASP', 8), ('CYS', 6), ('GLN', 9),
+      ('GLU', 9), ('GLY', 4), ('HIS', 10), ('ILE', 8), ('LEU', 8), ('LYS', 9),
+      ('MET', 8), ('PHE', 11), ('PRO', 7), ('SER', 6), ('THR', 7), ('TRP', 14),
+      ('TYR', 12), ('VAL', 7)
+  )
+  def testResidueAtoms(self, atom_name, num_residue_atoms):
+    residue_atoms = residue_constants.residue_atoms[atom_name]
+    self.assertLen(residue_atoms, num_residue_atoms)
+  def testStandardAtomMask(self):
+    with self.subTest('Check shape'):
+      self.assertEqual(residue_constants.STANDARD_ATOM_MASK.shape, (21, 37,))
+    with self.subTest('Check values'):
+      str_to_row = lambda s: [c == '1' for c in s]  # More clear/concise.
+      np.testing.assert_array_equal(
+          residue_constants.STANDARD_ATOM_MASK,
+          np.array([
+              # NB This was defined by c+p but looks sane.
+              str_to_row('11111                                '),  # ALA
+              str_to_row('111111     1           1     11 1    '),  # ARG
+              str_to_row('111111         11                    '),  # ASP
+              str_to_row('111111          11                   '),  # ASN
+              str_to_row('11111     1                          '),  # CYS
+              str_to_row('111111     1             11          '),  # GLU
+              str_to_row('111111     1              11         '),  # GLN
+              str_to_row('111 1                                '),  # GLY
+              str_to_row('111111       11     1    1           '),  # HIS
+              str_to_row('11111 11    1                        '),  # ILE
+              str_to_row('111111      11                       '),  # LEU
+              str_to_row('111111     1       1               1 '),  # LYS
+              str_to_row('111111            11                 '),  # MET
+              str_to_row('111111      11      11          1    '),  # PHE
+              str_to_row('111111     1                         '),  # PRO
+              str_to_row('11111   1                            '),  # SER
+              str_to_row('11111  1 1                           '),  # THR
+              str_to_row('111111      11       11 1   1    11  '),  # TRP
+              str_to_row('111111      11      11         11    '),  # TYR
+              str_to_row('11111 11                             '),  # VAL
+              str_to_row('                                     '),  # UNK
+          ]))
+    with self.subTest('Check row totals'):
+      # Check each row has the right number of atoms.
+      for row, restype in enumerate(residue_constants.restypes):  # A, R, ...
+        long_restype = residue_constants.restype_1to3[restype]  # ALA, ARG, ...
+        atoms_names = residue_constants.residue_atoms[
+            long_restype]  # ['C', 'CA', 'CB', 'N', 'O'], ...
+        self.assertLen(atoms_names,
+                       residue_constants.STANDARD_ATOM_MASK[row, :].sum(),
+                       long_restype)
+  def testAtomTypes(self):
+    self.assertEqual(residue_constants.atom_type_num, 37)
+    self.assertEqual(residue_constants.atom_types[0], 'N')
+    self.assertEqual(residue_constants.atom_types[1], 'CA')
+    self.assertEqual(residue_constants.atom_types[2], 'C')
+    self.assertEqual(residue_constants.atom_types[3], 'CB')
+    self.assertEqual(residue_constants.atom_types[4], 'O')
+    self.assertEqual(residue_constants.atom_order['N'], 0)
+    self.assertEqual(residue_constants.atom_order['CA'], 1)
+    self.assertEqual(residue_constants.atom_order['C'], 2)
+    self.assertEqual(residue_constants.atom_order['CB'], 3)
+    self.assertEqual(residue_constants.atom_order['O'], 4)
+    self.assertEqual(residue_constants.atom_type_num, 37)
+  def testRestypes(self):
+    three_letter_restypes = [
+        residue_constants.restype_1to3[r] for r  in residue_constants.restypes]
+    for restype, exp_restype in zip(
+        three_letter_restypes, sorted(residue_constants.restype_1to3.values())):
+      self.assertEqual(restype, exp_restype)
+    self.assertEqual(residue_constants.restype_num, 20)
+  def testSequenceToOneHotHHBlits(self):
+    one_hot = residue_constants.sequence_to_onehot(
+        'ABCDEFGHIJKLMNOPQRSTUVWXYZ-', residue_constants.HHBLITS_AA_TO_ID)
+    exp_one_hot = np.array(
+        [[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
+         [0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
+         [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
+         [0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
+         [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
+         [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
+         [0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
+         [0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
+         [0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
+         [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0],
+         [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
+         [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
+         [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
+         [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
+         [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0],
+         [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
+         [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0],
+         [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0],
+         [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0],
+         [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0],
+         [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
+         [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0],
+         [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0],
+         [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0],
+         [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0],
+         [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
+         [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]])
+    np.testing.assert_array_equal(one_hot, exp_one_hot)
+  def testSequenceToOneHotStandard(self):
+    one_hot = residue_constants.sequence_to_onehot(
+        'ARNDCQEGHILKMFPSTWYV', residue_constants.restype_order)
+    np.testing.assert_array_equal(one_hot, np.eye(20))
+  def testSequenceToOneHotUnknownMapping(self):
+    seq = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
+    expected_out = np.zeros([26, 21])
+    for row, position in enumerate(
+        [0, 20, 4, 3, 6, 13, 7, 8, 9, 20, 11, 10, 12, 2, 20, 14, 5, 1, 15, 16,
+         20, 19, 17, 20, 18, 20]):
+      expected_out[row, position] = 1
+    aa_types = residue_constants.sequence_to_onehot(
+        sequence=seq,
+        mapping=residue_constants.restype_order_with_x,
+        map_unknown_to_x=True)
+    self.assertTrue((aa_types == expected_out).all())
+  @parameterized.named_parameters(
+      ('lowercase', 'aaa'),  # Insertions in A3M.
+      ('gaps', '---'),  # Gaps in A3M.
+      ('dots', '...'),  # Gaps in A3M.
+      ('metadata', '>TEST'),  # FASTA metadata line.
+  )
+  def testSequenceToOneHotUnknownMappingError(self, seq):
+    with self.assertRaises(ValueError):
+      residue_constants.sequence_to_onehot(
+          sequence=seq,
+          mapping=residue_constants.restype_order_with_x,
+          map_unknown_to_x=True)
+if __name__ == '__main__':
+  absltest.main()
--- a/alphafold/common/stereo_chemical_props.txt
+++ b/alphafold/common/stereo_chemical_props.txt
+Bond			Residue		Mean		StdDev
+CA-CB			ALA		1.520		0.021
+N-CA			ALA		1.459		0.020
+CA-C			ALA		1.525		0.026
+C-O			ALA		1.229		0.019
+CA-CB			ARG		1.535		0.022
+CB-CG			ARG		1.521		0.027
+CG-CD			ARG		1.515		0.025
+CD-NE			ARG		1.460		0.017
+NE-CZ			ARG		1.326		0.013
+CZ-NH1			ARG		1.326		0.013
+CZ-NH2			ARG		1.326		0.013
+N-CA			ARG		1.459		0.020
+CA-C			ARG		1.525		0.026
+C-O			ARG		1.229		0.019
+CA-CB			ASN		1.527		0.026
+CB-CG			ASN		1.506		0.023
+CG-OD1			ASN		1.235		0.022
+CG-ND2			ASN		1.324		0.025
+N-CA			ASN		1.459		0.020
+CA-C			ASN		1.525		0.026
+C-O			ASN		1.229		0.019
+CA-CB			ASP		1.535		0.022
+CB-CG			ASP		1.513		0.021
+CG-OD1			ASP		1.249		0.023
+CG-OD2			ASP		1.249		0.023
+N-CA			ASP		1.459		0.020
+CA-C			ASP		1.525		0.026
+C-O			ASP		1.229		0.019
+CA-CB			CYS		1.526		0.013
+CB-SG			CYS		1.812		0.016
+N-CA			CYS		1.459		0.020
+CA-C			CYS		1.525		0.026
+C-O			CYS		1.229		0.019
+CA-CB			GLU		1.535		0.022
+CB-CG			GLU		1.517		0.019
+CG-CD			GLU		1.515		0.015
+CD-OE1			GLU		1.252		0.011
+CD-OE2			GLU		1.252		0.011
+N-CA			GLU		1.459		0.020
+CA-C			GLU		1.525		0.026
+C-O			GLU		1.229		0.019
+CA-CB			GLN		1.535		0.022
+CB-CG			GLN		1.521		0.027
+CG-CD			GLN		1.506		0.023
+CD-OE1			GLN		1.235		0.022
+CD-NE2			GLN		1.324		0.025
+N-CA			GLN		1.459		0.020
+CA-C			GLN		1.525		0.026
+C-O			GLN		1.229		0.019
+N-CA			GLY		1.456		0.015
+CA-C			GLY		1.514		0.016
+C-O			GLY		1.232		0.016
+CA-CB			HIS		1.535		0.022
+CB-CG			HIS		1.492		0.016
+CG-ND1			HIS		1.369		0.015
+CG-CD2			HIS		1.353		0.017
+ND1-CE1			HIS		1.343		0.025
+CD2-NE2			HIS		1.415		0.021
+CE1-NE2			HIS		1.322		0.023
+N-CA			HIS		1.459		0.020
+CA-C			HIS		1.525		0.026
+C-O			HIS		1.229		0.019
+CA-CB			ILE		1.544		0.023
+CB-CG1			ILE		1.536		0.028
+CB-CG2			ILE		1.524		0.031
+CG1-CD1			ILE		1.500		0.069
+N-CA			ILE		1.459		0.020
+CA-C			ILE		1.525		0.026
+C-O			ILE		1.229		0.019
+CA-CB			LEU		1.533		0.023
+CB-CG			LEU		1.521		0.029
+CG-CD1			LEU		1.514		0.037
+CG-CD2			LEU		1.514		0.037
+N-CA			LEU		1.459		0.020
+CA-C			LEU		1.525		0.026
+C-O			LEU		1.229		0.019
+CA-CB			LYS		1.535		0.022
+CB-CG			LYS		1.521		0.027
+CG-CD			LYS		1.520		0.034
+CD-CE			LYS		1.508		0.025
+CE-NZ			LYS		1.486		0.025
+N-CA			LYS		1.459		0.020
+CA-C			LYS		1.525		0.026
+C-O			LYS		1.229		0.019
+CA-CB			MET		1.535		0.022
+CB-CG			MET		1.509		0.032
+CG-SD			MET		1.807		0.026
+SD-CE			MET		1.774		0.056
+N-CA			MET		1.459		0.020
+CA-C			MET		1.525		0.026
+C-O			MET		1.229		0.019
+CA-CB			PHE		1.535		0.022
+CB-CG			PHE		1.509		0.017
+CG-CD1			PHE		1.383		0.015
+CG-CD2			PHE		1.383		0.015
+CD1-CE1			PHE		1.388		0.020
+CD2-CE2			PHE		1.388		0.020
+CE1-CZ			PHE		1.369		0.019
+CE2-CZ			PHE		1.369		0.019
+N-CA			PHE		1.459		0.020
+CA-C			PHE		1.525		0.026
+C-O			PHE		1.229		0.019
+CA-CB			PRO		1.531		0.020
+CB-CG			PRO		1.495		0.050
+CG-CD			PRO		1.502		0.033
+CD-N			PRO		1.474		0.014
+N-CA			PRO		1.468		0.017
+CA-C			PRO		1.524		0.020
+C-O			PRO		1.228		0.020
+CA-CB			SER		1.525		0.015
+CB-OG			SER		1.418		0.013
+N-CA			SER		1.459		0.020
+CA-C			SER		1.525		0.026
+C-O			SER		1.229		0.019
+CA-CB			THR		1.529		0.026
+CB-OG1			THR		1.428		0.020
+CB-CG2			THR		1.519		0.033
+N-CA			THR		1.459		0.020
+CA-C			THR		1.525		0.026
+C-O			THR		1.229		0.019
+CA-CB			TRP		1.535		0.022
+CB-CG			TRP		1.498		0.018
+CG-CD1			TRP		1.363		0.014
+CG-CD2			TRP		1.432		0.017
+CD1-NE1			TRP		1.375		0.017
+NE1-CE2			TRP		1.371		0.013
+CD2-CE2			TRP		1.409		0.012
+CD2-CE3			TRP		1.399		0.015
+CE2-CZ2			TRP		1.393		0.017
+CE3-CZ3			TRP		1.380		0.017
+CZ2-CH2			TRP		1.369		0.019
+CZ3-CH2			TRP		1.396		0.016
+N-CA			TRP		1.459		0.020
+CA-C			TRP		1.525		0.026
+C-O			TRP		1.229		0.019
+CA-CB			TYR		1.535		0.022
+CB-CG			TYR		1.512		0.015
+CG-CD1			TYR		1.387		0.013
+CG-CD2			TYR		1.387		0.013
+CD1-CE1			TYR		1.389		0.015
+CD2-CE2			TYR		1.389		0.015
+CE1-CZ			TYR		1.381		0.013
+CE2-CZ			TYR		1.381		0.013
+CZ-OH			TYR		1.374		0.017
+N-CA			TYR		1.459		0.020
+CA-C			TYR		1.525		0.026
+C-O			TYR		1.229		0.019
+CA-CB			VAL		1.543		0.021
+CB-CG1			VAL		1.524		0.021
+CB-CG2			VAL		1.524		0.021
+N-CA			VAL		1.459		0.020
+CA-C			VAL		1.525		0.026
+C-O			VAL		1.229		0.019
+-
+Angle			Residue		Mean		StdDev
+N-CA-CB			ALA		110.1		1.4
+CB-CA-C			ALA		110.1		1.5
+N-CA-C			ALA		111.0		2.7
+CA-C-O			ALA		120.1		2.1
+N-CA-CB			ARG		110.6		1.8
+CB-CA-C			ARG		110.4		2.0
+CA-CB-CG		ARG		113.4		2.2
+CB-CG-CD		ARG		111.6		2.6
+CG-CD-NE		ARG		111.8		2.1
+CD-NE-CZ		ARG		123.6		1.4
+NE-CZ-NH1		ARG		120.3		0.5
+NE-CZ-NH2		ARG		120.3		0.5
+NH1-CZ-NH2		ARG		119.4		1.1
+N-CA-C			ARG		111.0		2.7
+CA-C-O			ARG		120.1		2.1
+N-CA-CB			ASN		110.6		1.8
+CB-CA-C			ASN		110.4		2.0
+CA-CB-CG		ASN		113.4		2.2
+CB-CG-ND2		ASN		116.7		2.4
+CB-CG-OD1		ASN		121.6		2.0
+ND2-CG-OD1		ASN		121.9		2.3
+N-CA-C			ASN		111.0		2.7
+CA-C-O			ASN		120.1		2.1
+N-CA-CB			ASP		110.6		1.8
+CB-CA-C			ASP		110.4		2.0
+CA-CB-CG		ASP		113.4		2.2
+CB-CG-OD1		ASP		118.3		0.9
+CB-CG-OD2		ASP		118.3		0.9
+OD1-CG-OD2		ASP		123.3		1.9
+N-CA-C			ASP		111.0		2.7
+CA-C-O			ASP		120.1		2.1
+N-CA-CB			CYS		110.8		1.5
+CB-CA-C			CYS		111.5		1.2
+CA-CB-SG		CYS		114.2		1.1
+N-CA-C			CYS		111.0		2.7
+CA-C-O			CYS		120.1		2.1
+N-CA-CB			GLU		110.6		1.8
+CB-CA-C			GLU		110.4		2.0
+CA-CB-CG		GLU		113.4		2.2
+CB-CG-CD		GLU		114.2		2.7
+CG-CD-OE1		GLU		118.3		2.0
+CG-CD-OE2		GLU		118.3		2.0
+OE1-CD-OE2		GLU		123.3		1.2
+N-CA-C			GLU		111.0		2.7
+CA-C-O			GLU		120.1		2.1
+N-CA-CB			GLN		110.6		1.8
+CB-CA-C			GLN		110.4		2.0
+CA-CB-CG		GLN		113.4		2.2
+CB-CG-CD		GLN		111.6		2.6
+CG-CD-OE1		GLN		121.6		2.0
+CG-CD-NE2		GLN		116.7		2.4
+OE1-CD-NE2		GLN		121.9		2.3
+N-CA-C			GLN		111.0		2.7
+CA-C-O			GLN		120.1		2.1
+N-CA-C			GLY		113.1		2.5
+CA-C-O			GLY		120.6		1.8
+N-CA-CB			HIS		110.6		1.8
+CB-CA-C			HIS		110.4		2.0
+CA-CB-CG		HIS		113.6		1.7
+CB-CG-ND1		HIS		123.2		2.5
+CB-CG-CD2		HIS		130.8		3.1
+CG-ND1-CE1		HIS		108.2		1.4
+ND1-CE1-NE2		HIS		109.9		2.2
+CE1-NE2-CD2		HIS		106.6		2.5
+NE2-CD2-CG		HIS		109.2		1.9
+CD2-CG-ND1		HIS		106.0		1.4
+N-CA-C			HIS		111.0		2.7
+CA-C-O			HIS		120.1		2.1
+N-CA-CB			ILE		110.8		2.3
+CB-CA-C			ILE		111.6		2.0
+CA-CB-CG1		ILE		111.0		1.9
+CB-CG1-CD1		ILE		113.9		2.8
+CA-CB-CG2		ILE		110.9		2.0
+CG1-CB-CG2		ILE		111.4		2.2
+N-CA-C			ILE		111.0		2.7
+CA-C-O			ILE		120.1		2.1
+N-CA-CB			LEU		110.4		2.0
+CB-CA-C			LEU		110.2		1.9
+CA-CB-CG		LEU		115.3		2.3
+CB-CG-CD1		LEU		111.0		1.7
+CB-CG-CD2		LEU		111.0		1.7
+CD1-CG-CD2		LEU		110.5		3.0
+N-CA-C			LEU		111.0		2.7
+CA-C-O			LEU		120.1		2.1
+N-CA-CB			LYS		110.6		1.8
+CB-CA-C			LYS		110.4		2.0
+CA-CB-CG		LYS		113.4		2.2
+CB-CG-CD		LYS		111.6		2.6
+CG-CD-CE		LYS		111.9		3.0
+CD-CE-NZ		LYS		111.7		2.3
+N-CA-C			LYS		111.0		2.7
+CA-C-O			LYS		120.1		2.1
+N-CA-CB			MET		110.6		1.8
+CB-CA-C			MET		110.4		2.0
+CA-CB-CG		MET		113.3		1.7
+CB-CG-SD		MET		112.4		3.0
+CG-SD-CE		MET		100.2		1.6
+N-CA-C			MET		111.0		2.7
+CA-C-O			MET		120.1		2.1
+N-CA-CB			PHE		110.6		1.8
+CB-CA-C			PHE		110.4		2.0
+CA-CB-CG		PHE		113.9		2.4
+CB-CG-CD1		PHE		120.8		0.7
+CB-CG-CD2		PHE		120.8		0.7
+CD1-CG-CD2		PHE		118.3		1.3
+CG-CD1-CE1		PHE		120.8		1.1
+CG-CD2-CE2		PHE		120.8		1.1
+CD1-CE1-CZ		PHE		120.1		1.2
+CD2-CE2-CZ		PHE		120.1		1.2
+CE1-CZ-CE2		PHE		120.0		1.8
+N-CA-C			PHE		111.0		2.7
+CA-C-O			PHE		120.1		2.1
+N-CA-CB			PRO		103.3		1.2
+CB-CA-C			PRO		111.7		2.1
+CA-CB-CG		PRO		104.8		1.9
+CB-CG-CD		PRO		106.5		3.9
+CG-CD-N			PRO		103.2		1.5
+CA-N-CD			PRO		111.7		1.4
+N-CA-C			PRO		112.1		2.6
+CA-C-O			PRO		120.2		2.4
+N-CA-CB			SER		110.5		1.5
+CB-CA-C			SER		110.1		1.9
+CA-CB-OG		SER		111.2		2.7
+N-CA-C			SER		111.0		2.7
+CA-C-O			SER		120.1		2.1
+N-CA-CB			THR		110.3		1.9
+CB-CA-C			THR		111.6		2.7
+CA-CB-OG1		THR		109.0		2.1
+CA-CB-CG2		THR		112.4		1.4
+OG1-CB-CG2		THR		110.0		2.3
+N-CA-C			THR		111.0		2.7
+CA-C-O			THR		120.1		2.1
+N-CA-CB			TRP		110.6		1.8
+CB-CA-C			TRP		110.4		2.0
+CA-CB-CG		TRP		113.7		1.9
+CB-CG-CD1		TRP		127.0		1.3
+CB-CG-CD2		TRP		126.6		1.3
+CD1-CG-CD2		TRP		106.3		0.8
+CG-CD1-NE1		TRP		110.1		1.0
+CD1-NE1-CE2		TRP		109.0		0.9
+NE1-CE2-CD2		TRP		107.3		1.0
+CE2-CD2-CG		TRP		107.3		0.8
+CG-CD2-CE3		TRP		133.9		0.9
+NE1-CE2-CZ2		TRP		130.4		1.1
+CE3-CD2-CE2		TRP		118.7		1.2
+CD2-CE2-CZ2		TRP		122.3		1.2
+CE2-CZ2-CH2		TRP		117.4		1.0
+CZ2-CH2-CZ3		TRP		121.6		1.2
+CH2-CZ3-CE3		TRP		121.2		1.1
+CZ3-CE3-CD2		TRP		118.8		1.3
+N-CA-C			TRP		111.0		2.7
+CA-C-O			TRP		120.1		2.1
+N-CA-CB			TYR		110.6		1.8
+CB-CA-C			TYR		110.4		2.0
+CA-CB-CG		TYR		113.4		1.9
+CB-CG-CD1		TYR		121.0		0.6
+CB-CG-CD2		TYR		121.0		0.6
+CD1-CG-CD2		TYR		117.9		1.1
+CG-CD1-CE1		TYR		121.3		0.8
+CG-CD2-CE2		TYR		121.3		0.8
+CD1-CE1-CZ		TYR		119.8		0.9
+CD2-CE2-CZ		TYR		119.8		0.9
+CE1-CZ-CE2		TYR		119.8		1.6
+CE1-CZ-OH		TYR		120.1		2.7
+CE2-CZ-OH		TYR		120.1		2.7
+N-CA-C			TYR		111.0		2.7
+CA-C-O			TYR		120.1		2.1
+N-CA-CB			VAL		111.5		2.2
+CB-CA-C			VAL		111.4		1.9
+CA-CB-CG1		VAL		110.9		1.5
+CA-CB-CG2		VAL		110.9		1.5
+CG1-CB-CG2		VAL		110.9		1.6
+N-CA-C			VAL		111.0		2.7
+CA-C-O			VAL		120.1		2.1
+-
+Non-bonded distance     Minimum Dist    Tolerance
+C-C                     3.4             1.5
+C-N                     3.25            1.5
+C-S                     3.5             1.5
+C-O                     3.22            1.5
+N-N                     3.1             1.5
+N-S                     3.35            1.5
+N-O                     3.07            1.5
+O-S                     3.32            1.5
+O-O                     3.04            1.5
+S-S                     2.03            1.0
+-
--- a/alphafold/common/testdata/2rbg.pdb
+++ b/alphafold/common/testdata/2rbg.pdb
--- a/alphafold/common/testdata/5nmu.pdb
+++ b/alphafold/common/testdata/5nmu.pdb