Merge branch 'dtk24.04.1'

15cd3506 · mashun1 · 24e633dc · 19085464 · 15cd3506 · 15cd3506
Commit 15cd3506 authored Aug 29, 2024 by mashun1
20 changed files
--- a/.gitignore
+++ b/.gitignore
+*.egg*
+tryme.ipynb
+build/
+dist/
+test/
+temp_output/
+temp_fasta/
+*pycache*
--- a/Dockerfile
+++ b/Dockerfile
+FROM image.sourcefind.cn:5000/dcu/admin/base/jax:0.4.23-ubuntu20.04-dtk24.04.1-py3.10
+# RUN apt update
+# WORKDIR /app
+# WORKDIR /app/softwares
+# RUN git clone https://github.com/soedinglab/hh-suite.git
+# RUN mkdir -p hh-suite/build && cd hh-suite/build && cmake -DCMAKE_INSTALL_PREFIX=. .. && make -j 4 && make install
+# ENV PATH=/app/softwares/hh-suite/build/bin:/app/softwares/hh-suite/build/scripts:$PATH
+# WORKDIR /app/softwares
+# RUN wget https://github.com/TimoLassmann/kalign/archive/refs/tags/v3.4.0.zip && unzip v3.4.0.zip && cd kalign-3.4.0 && mkdir build && cd build && cmake .. && make && make install
+# WORKDIR /app/softwares
+# RUN sudo apt install doxygen -y
+# RUN wget https://github.com/openmm/openmm/archive/refs/tags/8.0.0.zip && unzip 8.0.0.zip && cd openmm-8.0.0 && mkdir build && cd build && cmake .. && make && sudo make install && sudo make PythonInstall
+# WORKDIR /app/softwares
+# RUN wget https://github.com/openmm/pdbfixer/archive/refs/tags/1.9.zip && unzip 1.9.zip && cd pdbfixer-1.9 && python setup.py install 
+# RUN sudo apt install hmmer -y
+# WORKDIR /app
+# COPY . /app/alphafold2
+# RUN ls
+# RUN pip install --no-cache-dir -r /app/alphafold2/requirements_dcu.txt -i https://mirrors.ustc.edu.cn/pypi/web/simple
+# RUN pip install dm-haiku==0.0.11 flax==0.7.1 jmp==0.0.2 tabulate==0.8.9 --no-deps jax -i https://mirrors.ustc.edu.cn/pypi/web/simple
+# RUN pip install orbax==0.1.6 orbax-checkpoint==0.1.6 optax==0.2.2 -i https://mirrors.ustc.edu.cn/pypi/web/simple
+# WORKDIR /app/alphafold2
+# RUN python setup.py install
--- a/README.md
+++ b/README.md
-<!--
- * @Author: zhuww
- * @email: zhuww@sugon.com
- * @Date: 2023-04-06 18:04:07
- * @LastEditTime: 2023-12-26 15:54:01
-->
 # AF2
 ## 论文
 - [https://www.nature.com/articles/s41586-021-03819-2](https://www.nature.com/articles/s41586-021-03819-2)
@@ -19,9 +14,17 @@ AlphaFold2通过从蛋白质序列和结构数据中提取信息，使用神经
 ![img](./docs/alphafold2_1.png)
-<!-- ## 环境配置
+## 环境配置
+### Docker（方法一）
+    # 使用该方法不需要下载本仓库，镜像中已包含可运行代码，但需要挂载相应的数据文件
-### Docker
+    docker pull image.sourcefind.cn:5000/dcu/admin/base/custom:alphafold2-dtk24.04.1-py310
+    docker run --shm-size 100g --network=host --name=alphafold2 --privileged --device=/dev/kfd --device=/dev/dri --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -v 本地数据地址:镜像数据地址 -v /opt/hyhal:/opt/hyhal:ro -it <your IMAGE ID> bash
+### Docker（方法二）
    docker pull image.sourcefind.cn:5000/dcu/admin/base/jax:0.4.23-ubuntu20.04-dtk24.04.1-py3.10
@@ -45,7 +48,7 @@ AlphaFold2通过从蛋白质序列和结构数据中提取信息，使用神经
    export PATH="$(pwd)/bin:$(pwd)/scripts:$PATH"
    wget https://github.com/TimoLassmann/kalign/archive/refs/tags/v3.4.0.zip
-    unzip 3.4.0.zip && cd kalign-3.4.0
+    unzip v3.4.0.zip && cd kalign-3.4.0
    mkdir build 
    cd build
    cmake .. 
@@ -65,23 +68,8 @@ AlphaFold2通过从蛋白质序列和结构数据中提取信息，使用神经
    wget https://github.com/openmm/pdbfixer/archive/refs/tags/1.9.zip
-    unzip 1.9.zip && cd pdbfixer-1.9 && python setup.py install  -->
+    unzip 1.9.zip && cd pdbfixer-1.9 && python setup.py install 
-## 环境配置
-提供[光源](https://www.sourcefind.cn/#/image/dcu/custom)拉取推理的docker镜像：
-```
-docker pull image.sourcefind.cn:5000/dcu/admin/base/custom:alphafold2-2.3.2-dtk23.10-py38
-# <Image ID>用上面拉取docker镜像的ID替换
-# <Host Path>主机端路径
-# <Container Path>容器映射路径
-docker run -it --name alphafold --privileged --shm-size=32G  --device=/dev/kfd --device=/dev/dri/ --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --ulimit memlock=-1:-1 --ipc=host --network host --group-add video -v <Host Path>:<Container Path> <Image ID> /bin/bash
-```
-镜像版本依赖：
-* DTK驱动：dtk23.10
-* Jax: 0.3.25
-* TensorFlow2: 2.11.0
-* python: python3.8
 ## 数据集
 推荐使用AlphaFold2中的开源数据集，包括BFD、MGnify、PDB70、Uniclust、Uniref90等,数据集大小约2.62TB。数据集格式如下：
@@ -171,12 +159,12 @@ $DOWNLOAD_DIR/
 ```
 [查看蛋白质3D结构](https://www.pdbus.org/3d-view)
-<div style="display: flex; justify-content: center; align-items: center;">
-  <img src="./docs/result_pdb.png" alt="Image">
+ID: 8U23
-  <div style="position: absolute; top: 50%; left: 50%; transform: translate(-50%, -50%); background: rgba(0, 0, 0, 0.5); color: #fff; padding: 10px;">
-    红色为真实结构，蓝色为预测结构。
+蓝色的为预测结构，黄色为真实结构
-  </div>
-</div>
+![alt text](image.png)
 ### 精度
 测试数据：[casp15](https://www.predictioncenter.org/casp15/targetlist.cgi)、[uniprot](https://www.uniprot.org/)，
@@ -196,6 +184,8 @@ $DOWNLOAD_DIR/
 | fp32 | 单体 | T1024 | 408 | 0.664 | 0.470 | 87.076 | 0.829 | 0.518 | 3.516 |
 | fp32 | 多体 | H1106 | 236 | 0.203 | 0.144 | 0.860 | 0.181 | 0.151 | 20.457 |
 ## 应用场景
 ### 算法类别

--- a/README_official.md
+++ b/README_official.md
+![header](imgs/header.jpg)
+# AlphaFold
+This package provides an implementation of the inference pipeline of AlphaFold
+v2. For simplicity, we refer to this model as AlphaFold throughout the rest of
+this document.
+We also provide:
+1.  An implementation of AlphaFold-Multimer. This represents a work in progress
+    and AlphaFold-Multimer isn't expected to be as stable as our monomer
+    AlphaFold system. [Read the guide](#updating-existing-installation) for how
+    to upgrade and update code.
+2.  The [technical note](docs/technical_note_v2.3.0.md) containing the models
+    and inference procedure for an updated AlphaFold v2.3.0.
+3.  A [CASP15 baseline](docs/casp15_predictions.zip) set of predictions along
+    with documentation of any manual interventions performed.
+Any publication that discloses findings arising from using this source code or
+the model parameters should [cite](#citing-this-work) the
+[AlphaFold paper](https://doi.org/10.1038/s41586-021-03819-2) and, if
+applicable, the
+[AlphaFold-Multimer paper](https://www.biorxiv.org/content/10.1101/2021.10.04.463034v1).
+Please also refer to the
+[Supplementary Information](https://static-content.springer.com/esm/art%3A10.1038%2Fs41586-021-03819-2/MediaObjects/41586_2021_3819_MOESM1_ESM.pdf)
+for a detailed description of the method.
+**You can use a slightly simplified version of AlphaFold with
+[this Colab notebook](https://colab.research.google.com/github/deepmind/alphafold/blob/main/notebooks/AlphaFold.ipynb)**
+or community-supported versions (see below).
+If you have any questions, please contact the AlphaFold team at
+[alphafold@deepmind.com](mailto:alphafold@deepmind.com).
+![CASP14 predictions](imgs/casp14_predictions.gif)
+## Installation and running your first prediction
+You will need a machine running Linux, AlphaFold does not support other
+operating systems. Full installation requires up to 3 TB of disk space to keep
+genetic databases (SSD storage is recommended) and a modern NVIDIA GPU (GPUs
+with more memory can predict larger protein structures).
+Please follow these steps:
+1.  Install [Docker](https://www.docker.com/).
+    *   Install
+        [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html)
+        for GPU support.
+    *   Setup running
+        [Docker as a non-root user](https://docs.docker.com/engine/install/linux-postinstall/#manage-docker-as-a-non-root-user).
+1.  Clone this repository and `cd` into it.
+    ```bash
+    git clone https://github.com/deepmind/alphafold.git
+    cd ./alphafold
+    ```
+1.  Download genetic databases and model parameters:
+    *   Install `aria2c`. On most Linux distributions it is available via the
+    package manager as the `aria2` package (on Debian-based distributions this
+    can be installed by running `sudo apt install aria2`).
+    *   Please use the script `scripts/download_all_data.sh` to download
+    and set up full databases. This may take substantial time (download size is
+    556 GB), so we recommend running this script in the background:
+    ```bash
+    scripts/download_all_data.sh <DOWNLOAD_DIR> > download.log 2> download_all.log &
+    ```
+    *   **Note: The download directory `<DOWNLOAD_DIR>` should *not* be a
+    subdirectory in the AlphaFold repository directory.** If it is, the Docker
+    build will be slow as the large databases will be copied into the docker
+    build context.
+    *   It is possible to run AlphaFold with reduced databases; please refer to
+    the [complete documentation](#genetic-databases).
+1.  Check that AlphaFold will be able to use a GPU by running:
+    ```bash
+    docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi
+    ```
+    The output of this command should show a list of your GPUs. If it doesn't,
+    check if you followed all steps correctly when setting up the
+    [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html)
+    or take a look at the following
+    [NVIDIA Docker issue](https://github.com/NVIDIA/nvidia-docker/issues/1447#issuecomment-801479573).
+    If you wish to run AlphaFold using Singularity (a common containerization
+    platform on HPC systems) we recommend using some of the third party Singularity
+    setups as linked in https://github.com/deepmind/alphafold/issues/10 or
+    https://github.com/deepmind/alphafold/issues/24.
+1.  Build the Docker image:
+    ```bash
+    docker build -f docker/Dockerfile -t alphafold .
+    ```
+    If you encounter the following error:
+    ```
+    W: GPG error: https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64 InRelease: The following signatures couldn't be verified because the public key is not available: NO_PUBKEY A4B469963BF863CC
+    E: The repository 'https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64 InRelease' is not signed.
+    ```
+    use the workaround described in
+    https://github.com/deepmind/alphafold/issues/463#issuecomment-1124881779.
+1.  Install the `run_docker.py` dependencies. Note: You may optionally wish to
+    create a
+    [Python Virtual Environment](https://docs.python.org/3/tutorial/venv.html)
+    to prevent conflicts with your system's Python environment.
+    ```bash
+    pip3 install -r docker/requirements.txt
+    ```
+1.  Make sure that the output directory exists (the default is `/tmp/alphafold`)
+    and that you have sufficient permissions to write into it.
+1.  Run `run_docker.py` pointing to a FASTA file containing the protein
+    sequence(s) for which you wish to predict the structure (`--fasta_paths`
+    parameter). AlphaFold will search for the available templates before the
+    date specified by the `--max_template_date` parameter; this could be used to
+    avoid certain templates during modeling. `--data_dir` is the directory with
+    downloaded genetic databases and `--output_dir` is the absolute path to the
+    output directory.
+    ```bash
+    python3 docker/run_docker.py \
+      --fasta_paths=your_protein.fasta \
+      --max_template_date=2022-01-01 \
+      --data_dir=$DOWNLOAD_DIR \
+      --output_dir=/home/user/absolute_path_to_the_output_dir
+    ```
+1.  Once the run is over, the output directory shall contain predicted
+    structures of the target protein. Please check the documentation below for
+    additional options and troubleshooting tips.
+### Genetic databases
+This step requires `aria2c` to be installed on your machine.
+AlphaFold needs multiple genetic (sequence) databases to run:
+*   [BFD](https://bfd.mmseqs.com/),
+*   [MGnify](https://www.ebi.ac.uk/metagenomics/),
+*   [PDB70](http://wwwuser.gwdg.de/~compbiol/data/hhsuite/databases/hhsuite_dbs/),
+*   [PDB](https://www.rcsb.org/) (structures in the mmCIF format),
+*   [PDB seqres](https://www.rcsb.org/) – only for AlphaFold-Multimer,
+*   [UniRef30 (FKA UniClust30)](https://uniclust.mmseqs.com/),
+*   [UniProt](https://www.uniprot.org/uniprot/) – only for AlphaFold-Multimer,
+*   [UniRef90](https://www.uniprot.org/help/uniref).
+We provide a script `scripts/download_all_data.sh` that can be used to download
+and set up all of these databases:
+*   Recommended default:
+    ```bash
+    scripts/download_all_data.sh <DOWNLOAD_DIR>
+    ```
+    will download the full databases.
+*   With `reduced_dbs` parameter:
+    ```bash
+    scripts/download_all_data.sh <DOWNLOAD_DIR> reduced_dbs
+    ```
+    will download a reduced version of the databases to be used with the
+    `reduced_dbs` database preset. This shall be used with the corresponding
+    AlphaFold parameter `--db_preset=reduced_dbs` later during the AlphaFold run
+    (please see [AlphaFold parameters](#running-alphafold) section).
+:ledger: **Note: The download directory `<DOWNLOAD_DIR>` should *not* be a
+subdirectory in the AlphaFold repository directory.** If it is, the Docker build
+will be slow as the large databases will be copied during the image creation.
+We don't provide exactly the database versions used in CASP14 – see the
+[note on reproducibility](#note-on-casp14-reproducibility). Some of the
+databases are mirrored for speed, see [mirrored databases](#mirrored-databases).
+:ledger: **Note: The total download size for the full databases is around 556 GB
+and the total size when unzipped is 2.62 TB. Please make sure you have a large
+enough hard drive space, bandwidth and time to download. We recommend using an
+SSD for better genetic search performance.**
+:ledger: **Note: If the download directory and datasets don't have full read and
+write permissions, it can cause errors with the MSA tools, with opaque
+(external) error messages. Please ensure the required permissions are applied,
+e.g. with the `sudo chmod 755 --recursive "$DOWNLOAD_DIR"` command.**
+The `download_all_data.sh` script will also download the model parameter files.
+Once the script has finished, you should have the following directory structure:
+```
+$DOWNLOAD_DIR/                             # Total: ~ 2.62 TB (download: 556 GB)
+    bfd/                                   # ~ 1.8 TB (download: 271.6 GB)
+        # 6 files.
+    mgnify/                                # ~ 120 GB (download: 67 GB)
+        mgy_clusters_2022_05.fa
+    params/                                # ~ 5.3 GB (download: 5.3 GB)
+        # 5 CASP14 models,
+        # 5 pTM models,
+        # 5 AlphaFold-Multimer models,
+        # LICENSE,
+        # = 16 files.
+    pdb70/                                 # ~ 56 GB (download: 19.5 GB)
+        # 9 files.
+    pdb_mmcif/                             # ~ 238 GB (download: 43 GB)
+        mmcif_files/
+            # About 199,000 .cif files.
+        obsolete.dat
+    pdb_seqres/                            # ~ 0.2 GB (download: 0.2 GB)
+        pdb_seqres.txt
+    small_bfd/                             # ~ 17 GB (download: 9.6 GB)
+        bfd-first_non_consensus_sequences.fasta
+    uniref30/                              # ~ 206 GB (download: 52.5 GB)
+        # 7 files.
+    uniprot/                               # ~ 105 GB (download: 53 GB)
+        uniprot.fasta
+    uniref90/                              # ~ 67 GB (download: 34 GB)
+        uniref90.fasta
+```
+`bfd/` is only downloaded if you download the full databases, and `small_bfd/`
+is only downloaded if you download the reduced databases.
+### Model parameters
+While the AlphaFold code is licensed under the Apache 2.0 License, the AlphaFold
+parameters and CASP15 prediction data are made available under the terms of the
+CC BY 4.0 license. Please see the [Disclaimer](#license-and-disclaimer) below
+for more detail.
+The AlphaFold parameters are available from
+https://storage.googleapis.com/alphafold/alphafold_params_2022-12-06.tar, and
+are downloaded as part of the `scripts/download_all_data.sh` script. This script
+will download parameters for:
+*   5 models which were used during CASP14, and were extensively validated for
+    structure prediction quality (see Jumper et al. 2021, Suppl. Methods 1.12
+    for details).
+*   5 pTM models, which were fine-tuned to produce pTM (predicted TM-score) and
+    (PAE) predicted aligned error values alongside their structure predictions
+    (see Jumper et al. 2021, Suppl. Methods 1.9.7 for details).
+*   5 AlphaFold-Multimer models that produce pTM and PAE values alongside their
+    structure predictions.
+### Updating existing installation
+If you have a previous version you can either reinstall fully from scratch
+(remove everything and run the setup from scratch) or you can do an incremental
+update that will be significantly faster but will require a bit more work. Make
+sure you follow these steps in the exact order they are listed below:
+1.  **Update the code.**
+    *   Go to the directory with the cloned AlphaFold repository and run `git
+        fetch origin main` to get all code updates.
+1.  **Update the UniProt, UniRef, MGnify and PDB seqres databases.**
+    *   Remove `<DOWNLOAD_DIR>/uniprot`.
+    *   Run `scripts/download_uniprot.sh <DOWNLOAD_DIR>`.
+    *   Remove `<DOWNLOAD_DIR>/uniclust30`.
+    *   Run `scripts/download_uniref30.sh <DOWNLOAD_DIR>`.
+    *   Remove `<DOWNLOAD_DIR>/uniref90`.
+    *   Run `scripts/download_uniref90.sh <DOWNLOAD_DIR>`.
+    *   Remove `<DOWNLOAD_DIR>/mgnify`.
+    *   Run `scripts/download_mgnify.sh <DOWNLOAD_DIR>`.
+    *   Remove `<DOWNLOAD_DIR>/pdb_mmcif`. It is needed to have PDB SeqRes and
+        PDB from exactly the same date. Failure to do this step will result in
+        potential errors when searching for templates when running
+        AlphaFold-Multimer.
+    *   Run `scripts/download_pdb_mmcif.sh <DOWNLOAD_DIR>`.
+    *   Run `scripts/download_pdb_seqres.sh <DOWNLOAD_DIR>`.
+1.  **Update the model parameters.**
+    *   Remove the old model parameters in `<DOWNLOAD_DIR>/params`.
+    *   Download new model parameters using
+        `scripts/download_alphafold_params.sh <DOWNLOAD_DIR>`.
+1.  **Follow [Running AlphaFold](#running-alphafold).**
+#### Using deprecated model weights
+To use the deprecated v2.2.0 AlphaFold-Multimer model weights:
+1.  Change `SOURCE_URL` in `scripts/download_alphafold_params.sh` to
+    `https://storage.googleapis.com/alphafold/alphafold_params_2022-03-02.tar`,
+    and download the old parameters.
+2.  Change the `_v3` to `_v2` in the multimer `MODEL_PRESETS` in `config.py`.
+To use the deprecated v2.1.0 AlphaFold-Multimer model weights:
+1.  Change `SOURCE_URL` in `scripts/download_alphafold_params.sh` to
+    `https://storage.googleapis.com/alphafold/alphafold_params_2022-01-19.tar`,
+    and download the old parameters.
+2.  Remove the `_v3` in the multimer `MODEL_PRESETS` in `config.py`.
+## Running AlphaFold
+**The simplest way to run AlphaFold is using the provided Docker script.** This
+was tested on Google Cloud with a machine using the `nvidia-gpu-cloud-image`
+with 12 vCPUs, 85 GB of RAM, a 100 GB boot disk, the databases on an additional
+3 TB disk, and an A100 GPU. For your first run, please follow the instructions
+from [Installation and running your first prediction](#installation-and-running-your-first-prediction)
+section.
+1.  By default, Alphafold will attempt to use all visible GPU devices. To use a
+    subset, specify a comma-separated list of GPU UUID(s) or index(es) using the
+    `--gpu_devices` flag. See
+    [GPU enumeration](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/user-guide.html#gpu-enumeration)
+    for more details.
+1.  You can control which AlphaFold model to run by adding the `--model_preset=`
+    flag. We provide the following models:
+    *   **monomer**: This is the original model used at CASP14 with no
+        ensembling.
+    *   **monomer\_casp14**: This is the original model used at CASP14 with
+        `num_ensemble=8`, matching our CASP14 configuration. This is largely
+        provided for reproducibility as it is 8x more computationally expensive
+        for limited accuracy gain (+0.1 average GDT gain on CASP14 domains).
+    *   **monomer\_ptm**: This is the original CASP14 model fine tuned with the
+        pTM head, providing a pairwise confidence measure. It is slightly less
+        accurate than the normal monomer model.
+    *   **multimer**: This is the [AlphaFold-Multimer](#citing-this-work) model.
+        To use this model, provide a multi-sequence FASTA file. In addition, the
+        UniProt database should have been downloaded.
+1.  You can control MSA speed/quality tradeoff by adding
+    `--db_preset=reduced_dbs` or `--db_preset=full_dbs` to the run command. We
+    provide the following presets:
+    *   **reduced\_dbs**: This preset is optimized for speed and lower hardware
+        requirements. It runs with a reduced version of the BFD database. It
+        requires 8 CPU cores (vCPUs), 8 GB of RAM, and 600 GB of disk space.
+    *   **full\_dbs**: This runs with all genetic databases used at CASP14.
+    Running the command above with the `monomer` model preset and the
+    `reduced_dbs` data preset would look like this:
+    ```bash
+    python3 docker/run_docker.py \
+      --fasta_paths=T1050.fasta \
+      --max_template_date=2020-05-14 \
+      --model_preset=monomer \
+      --db_preset=reduced_dbs \
+      --data_dir=$DOWNLOAD_DIR \
+      --output_dir=/home/user/absolute_path_to_the_output_dir
+    ```
+1.  After generating the predicted model, AlphaFold runs a relaxation
+    step to improve local geometry. By default, only the best model (by
+    pLDDT) is relaxed (`--models_to_relax=best`), but also all of the models
+    (`--models_to_relax=all`) or none of the models (`--models_to_relax=none`)
+    can be relaxed.
+1.  The relaxation step can be run on GPU (faster, but could be less stable) or
+    CPU (slow, but stable). This can be controlled with `--enable_gpu_relax=true`
+    (default) or `--enable_gpu_relax=false`.
+1.  AlphaFold can re-use MSAs (multiple sequence alignments) for the same
+    sequence via `--use_precomputed_msas=true` option; this can be useful for
+    trying different AlphaFold parameters. This option assumes that the
+    directory structure generated by the first AlphaFold run in the output
+    directory exists and that the protein sequence is the same.
+### Running AlphaFold-Multimer
+All steps are the same as when running the monomer system, but you will have to
+*   provide an input fasta with multiple sequences,
+*   set `--model_preset=multimer`,
+An example that folds a protein complex `multimer.fasta`:
+```bash
+python3 docker/run_docker.py \
+  --fasta_paths=multimer.fasta \
+  --max_template_date=2020-05-14 \
+  --model_preset=multimer \
+  --data_dir=$DOWNLOAD_DIR \
+  --output_dir=/home/user/absolute_path_to_the_output_dir
+```
+By default the multimer system will run 5 seeds per model (25 total predictions)
+for a small drop in accuracy you may wish to run a single seed per model. This
+can be done via the `--num_multimer_predictions_per_model` flag, e.g. set it to
+`--num_multimer_predictions_per_model=1` to run a single seed per model.
+### AlphaFold prediction speed
+The table below reports prediction runtimes for proteins of various lengths. We
+only measure unrelaxed structure prediction with three recycles while
+excluding runtimes from MSA and template search. When running
+`docker/run_docker.py` with `--benchmark=true`, this runtime is stored in
+`timings.json`. All runtimes are from a single A100 NVIDIA GPU. Prediction
+speed on A100 for smaller structures can be improved by increasing
+`global_config.subbatch_size` in `alphafold/model/config.py`.
+No. residues | Prediction time (s)
+-----------: | ------------------:
+100          | 4.9
+200          | 7.7
+300          | 13
+400          | 18
+500          | 29
+600          | 36
+700          | 53
+800          | 60
+900          | 91
+1,000        | 96
+1,100        | 140
+1,500        | 280
+2,000        | 450
+2,500        | 969
+3,000        | 1,240
+3,500        | 2,465
+4,000        | 5,660
+4,500        | 12,475
+5,000        | 18,824
+### Examples
+Below are examples on how to use AlphaFold in different scenarios.
+#### Folding a monomer
+Say we have a monomer with the sequence `<SEQUENCE>`. The input fasta should be:
+```fasta
+>sequence_name
+<SEQUENCE>
+```
+Then run the following command:
+```bash
+python3 docker/run_docker.py \
+  --fasta_paths=monomer.fasta \
+  --max_template_date=2021-11-01 \
+  --model_preset=monomer \
+  --data_dir=$DOWNLOAD_DIR \
+  --output_dir=/home/user/absolute_path_to_the_output_dir
+```
+#### Folding a homomer
+Say we have a homomer with 3 copies of the same sequence `<SEQUENCE>`. The input
+fasta should be:
+```fasta
+>sequence_1
+<SEQUENCE>
+>sequence_2
+<SEQUENCE>
+>sequence_3
+<SEQUENCE>
+```
+Then run the following command:
+```bash
+python3 docker/run_docker.py \
+  --fasta_paths=homomer.fasta \
+  --max_template_date=2021-11-01 \
+  --model_preset=multimer \
+  --data_dir=$DOWNLOAD_DIR \
+  --output_dir=/home/user/absolute_path_to_the_output_dir
+```
+#### Folding a heteromer
+Say we have an A2B3 heteromer, i.e. with 2 copies of `<SEQUENCE A>` and 3 copies
+of `<SEQUENCE B>`. The input fasta should be:
+```fasta
+>sequence_1
+<SEQUENCE A>
+>sequence_2
+<SEQUENCE A>
+>sequence_3
+<SEQUENCE B>
+>sequence_4
+<SEQUENCE B>
+>sequence_5
+<SEQUENCE B>
+```
+Then run the following command:
+```bash
+python3 docker/run_docker.py \
+  --fasta_paths=heteromer.fasta \
+  --max_template_date=2021-11-01 \
+  --model_preset=multimer \
+  --data_dir=$DOWNLOAD_DIR \
+  --output_dir=/home/user/absolute_path_to_the_output_dir
+```
+#### Folding multiple monomers one after another
+Say we have a two monomers, `monomer1.fasta` and `monomer2.fasta`.
+We can fold both sequentially by using the following command:
+```bash
+python3 docker/run_docker.py \
+  --fasta_paths=monomer1.fasta,monomer2.fasta \
+  --max_template_date=2021-11-01 \
+  --model_preset=monomer \
+  --data_dir=$DOWNLOAD_DIR \
+  --output_dir=/home/user/absolute_path_to_the_output_dir
+```
+#### Folding multiple multimers one after another
+Say we have a two multimers, `multimer1.fasta` and `multimer2.fasta`.
+We can fold both sequentially by using the following command:
+```bash
+python3 docker/run_docker.py \
+  --fasta_paths=multimer1.fasta,multimer2.fasta \
+  --max_template_date=2021-11-01 \
+  --model_preset=multimer \
+  --data_dir=$DOWNLOAD_DIR \
+  --output_dir=/home/user/absolute_path_to_the_output_dir
+```
+### AlphaFold output
+The outputs will be saved in a subdirectory of the directory provided via the
+`--output_dir` flag of `run_docker.py` (defaults to `/tmp/alphafold/`). The
+outputs include the computed MSAs, unrelaxed structures, relaxed structures,
+ranked structures, raw model outputs, prediction metadata, and section timings.
+The `--output_dir` directory will have the following structure:
+```
+<target_name>/
+    features.pkl
+    ranked_{0,1,2,3,4}.pdb
+    ranking_debug.json
+    relax_metrics.json
+    relaxed_model_{1,2,3,4,5}.pdb
+    result_model_{1,2,3,4,5}.pkl
+    timings.json
+    unrelaxed_model_{1,2,3,4,5}.pdb
+    msas/
+        bfd_uniref_hits.a3m
+        mgnify_hits.sto
+        uniref90_hits.sto
+```
+The contents of each output file are as follows:
+*   `features.pkl` – A `pickle` file containing the input feature NumPy arrays
+    used by the models to produce the structures.
+*   `unrelaxed_model_*.pdb` – A PDB format text file containing the predicted
+    structure, exactly as outputted by the model.
+*   `relaxed_model_*.pdb` – A PDB format text file containing the predicted
+    structure, after performing an Amber relaxation procedure on the unrelaxed
+    structure prediction (see Jumper et al. 2021, Suppl. Methods 1.8.6 for
+    details).
+*   `ranked_*.pdb` – A PDB format text file containing the predicted structures,
+    after reordering by model confidence. Here `ranked_i.pdb` should contain
+    the prediction with the (`i + 1`)-th highest confidence (so that
+    `ranked_0.pdb` has the highest confidence). To rank model confidence, we use
+    predicted LDDT (pLDDT) scores (see Jumper et al. 2021, Suppl. Methods 1.9.6
+    for details). If `--models_to_relax=all` then all ranked structures are
+    relaxed. If `--models_to_relax=best` then only `ranked_0.pdb` is relaxed
+    (the rest are unrelaxed). If `--models_to_relax=none`, then the ranked
+    structures are all unrelaxed.
+*   `ranking_debug.json` – A JSON format text file containing the pLDDT values
+    used to perform the model ranking, and a mapping back to the original model
+    names.
+*   `relax_metrics.json` – A JSON format text file containing relax metrics, for
+    instance remaining violations.
+*   `timings.json` – A JSON format text file containing the times taken to run
+    each section of the AlphaFold pipeline.
+*   `msas/` - A directory containing the files describing the various genetic
+    tool hits that were used to construct the input MSA.
+*   `result_model_*.pkl` – A `pickle` file containing a nested dictionary of the
+    various NumPy arrays directly produced by the model. In addition to the
+    output of the structure module, this includes auxiliary outputs such as:
+    *   Distograms (`distogram/logits` contains a NumPy array of shape [N_res,
+        N_res, N_bins] and `distogram/bin_edges` contains the definition of the
+        bins).
+    *   Per-residue pLDDT scores (`plddt` contains a NumPy array of shape
+        [N_res] with the range of possible values from `0` to `100`, where `100`
+        means most confident). This can serve to identify sequence regions
+        predicted with high confidence or as an overall per-target confidence
+        score when averaged across residues.
+    *   Present only if using pTM models: predicted TM-score (`ptm` field
+        contains a scalar). As a predictor of a global superposition metric,
+        this score is designed to also assess whether the model is confident in
+        the overall domain packing.
+    *   Present only if using pTM models: predicted pairwise aligned errors
+        (`predicted_aligned_error` contains a NumPy array of shape [N_res,
+        N_res] with the range of possible values from `0` to
+        `max_predicted_aligned_error`, where `0` means most confident). This can
+        serve for a visualisation of domain packing confidence within the
+        structure.
+The pLDDT confidence measure is stored in the B-factor field of the output PDB
+files (although unlike a B-factor, higher pLDDT is better, so care must be taken
+when using for tasks such as molecular replacement).
+This code has been tested to match mean top-1 accuracy on a CASP14 test set with
+pLDDT ranking over 5 model predictions (some CASP targets were run with earlier
+versions of AlphaFold and some had manual interventions; see our forthcoming
+publication for details). Some targets such as T1064 may also have high
+individual run variance over random seeds.
+## Inferencing many proteins
+The provided inference script is optimized for predicting the structure of a
+single protein, and it will compile the neural network to be specialized to
+exactly the size of the sequence, MSA, and templates. For large proteins, the
+compile time is a negligible fraction of the runtime, but it may become more
+significant for small proteins or if the multi-sequence alignments are already
+precomputed. In the bulk inference case, it may make sense to use our
+`make_fixed_size` function to pad the inputs to a uniform size, thereby reducing
+the number of compilations required.
+We do not provide a bulk inference script, but it should be straightforward to
+develop on top of the `RunModel.predict` method with a parallel system for
+precomputing multi-sequence alignments. Alternatively, this script can be run
+repeatedly with only moderate overhead.
+## Note on CASP14 reproducibility
+AlphaFold's output for a small number of proteins has high inter-run variance,
+and may be affected by changes in the input data. The CASP14 target T1064 is a
+notable example; the large number of SARS-CoV-2-related sequences recently
+deposited changes its MSA significantly. This variability is somewhat mitigated
+by the model selection process; running 5 models and taking the most confident.
+To reproduce the results of our CASP14 system as closely as possible you must
+use the same database versions we used in CASP. These may not match the default
+versions downloaded by our scripts.
+For genetics:
+*   UniRef90:
+    [v2020_01](https://ftp.uniprot.org/pub/databases/uniprot/previous_releases/release-2020_01/uniref/)
+*   MGnify:
+    [v2018_12](http://ftp.ebi.ac.uk/pub/databases/metagenomics/peptide_database/2018_12/)
+*   Uniclust30: [v2018_08](http://wwwuser.gwdg.de/~compbiol/uniclust/2018_08/)
+*   BFD: [only version available](https://bfd.mmseqs.com/)
+For templates:
+*   PDB: (downloaded 2020-05-14)
+*   PDB70:
+    [2020-05-13](http://wwwuser.gwdg.de/~compbiol/data/hhsuite/databases/hhsuite_dbs/old-releases/pdb70_from_mmcif_200513.tar.gz)
+An alternative for templates is to use the latest PDB and PDB70, but pass the
+flag `--max_template_date=2020-05-14`, which restricts templates only to
+structures that were available at the start of CASP14.
+## Citing this work
+If you use the code or data in this package, please cite:
+```bibtex
+@Article{AlphaFold2021,
+  author  = {Jumper, John and Evans, Richard and Pritzel, Alexander and Green, Tim and Figurnov, Michael and Ronneberger, Olaf and Tunyasuvunakool, Kathryn and Bates, Russ and {\v{Z}}{\'\i}dek, Augustin and Potapenko, Anna and Bridgland, Alex and Meyer, Clemens and Kohl, Simon A A and Ballard, Andrew J and Cowie, Andrew and Romera-Paredes, Bernardino and Nikolov, Stanislav and Jain, Rishub and Adler, Jonas and Back, Trevor and Petersen, Stig and Reiman, David and Clancy, Ellen and Zielinski, Michal and Steinegger, Martin and Pacholska, Michalina and Berghammer, Tamas and Bodenstein, Sebastian and Silver, David and Vinyals, Oriol and Senior, Andrew W and Kavukcuoglu, Koray and Kohli, Pushmeet and Hassabis, Demis},
+  journal = {Nature},
+  title   = {Highly accurate protein structure prediction with {AlphaFold}},
+  year    = {2021},
+  volume  = {596},
+  number  = {7873},
+  pages   = {583--589},
+  doi     = {10.1038/s41586-021-03819-2}
+}
+```
+In addition, if you use the AlphaFold-Multimer mode, please cite:
+```bibtex
+@article {AlphaFold-Multimer2021,
+  author       = {Evans, Richard and O{\textquoteright}Neill, Michael and Pritzel, Alexander and Antropova, Natasha and Senior, Andrew and Green, Tim and {\v{Z}}{\'\i}dek, Augustin and Bates, Russ and Blackwell, Sam and Yim, Jason and Ronneberger, Olaf and Bodenstein, Sebastian and Zielinski, Michal and Bridgland, Alex and Potapenko, Anna and Cowie, Andrew and Tunyasuvunakool, Kathryn and Jain, Rishub and Clancy, Ellen and Kohli, Pushmeet and Jumper, John and Hassabis, Demis},
+  journal      = {bioRxiv},
+  title        = {Protein complex prediction with AlphaFold-Multimer},
+  year         = {2021},
+  elocation-id = {2021.10.04.463034},
+  doi          = {10.1101/2021.10.04.463034},
+  URL          = {https://www.biorxiv.org/content/early/2021/10/04/2021.10.04.463034},
+  eprint       = {https://www.biorxiv.org/content/early/2021/10/04/2021.10.04.463034.full.pdf},
+}
+```
+## Community contributions
+Colab notebooks provided by the community (please note that these notebooks may
+vary from our full AlphaFold system and we did not validate their accuracy):
+*   The
+    [ColabFold AlphaFold2 notebook](https://colab.research.google.com/github/sokrypton/ColabFold/blob/main/AlphaFold2.ipynb)
+    by Martin Steinegger, Sergey Ovchinnikov and Milot Mirdita, which uses an
+    API hosted at the Södinglab based on the MMseqs2 server
+    [(Mirdita et al. 2019, Bioinformatics)](https://academic.oup.com/bioinformatics/article/35/16/2856/5280135)
+    for the multiple sequence alignment creation.
+## Acknowledgements
+AlphaFold communicates with and/or references the following separate libraries
+and packages:
+*   [Abseil](https://github.com/abseil/abseil-py)
+*   [Biopython](https://biopython.org)
+*   [Chex](https://github.com/deepmind/chex)
+*   [Colab](https://research.google.com/colaboratory/)
+*   [Docker](https://www.docker.com)
+*   [HH Suite](https://github.com/soedinglab/hh-suite)
+*   [HMMER Suite](http://eddylab.org/software/hmmer)
+*   [Haiku](https://github.com/deepmind/dm-haiku)
+*   [Immutabledict](https://github.com/corenting/immutabledict)
+*   [JAX](https://github.com/google/jax/)
+*   [Kalign](https://msa.sbc.su.se/cgi-bin/msa.cgi)
+*   [matplotlib](https://matplotlib.org/)
+*   [ML Collections](https://github.com/google/ml_collections)
+*   [NumPy](https://numpy.org)
+*   [OpenMM](https://github.com/openmm/openmm)
+*   [OpenStructure](https://openstructure.org)
+*   [pandas](https://pandas.pydata.org/)
+*   [pymol3d](https://github.com/avirshup/py3dmol)
+*   [SciPy](https://scipy.org)
+*   [Sonnet](https://github.com/deepmind/sonnet)
+*   [TensorFlow](https://github.com/tensorflow/tensorflow)
+*   [Tree](https://github.com/deepmind/tree)
+*   [tqdm](https://github.com/tqdm/tqdm)
+We thank all their contributors and maintainers!
+## Get in Touch
+If you have any questions not covered in this overview, please contact the
+AlphaFold team at [alphafold@deepmind.com](mailto:alphafold@deepmind.com).
+We would love to hear your feedback and understand how AlphaFold has been useful
+in your research. Share your stories with us at
+[alphafold@deepmind.com](mailto:alphafold@deepmind.com).
+## License and Disclaimer
+This is not an officially supported Google product.
+Copyright 2022 DeepMind Technologies Limited.
+### AlphaFold Code License
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use
+this file except in compliance with the License. You may obtain a copy of the
+License at https://www.apache.org/licenses/LICENSE-2.0.
+Unless required by applicable law or agreed to in writing, software distributed
+under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR
+CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+### Model Parameters License
+The AlphaFold parameters are made available under the terms of the Creative
+Commons Attribution 4.0 International (CC BY 4.0) license. You can find details
+at: https://creativecommons.org/licenses/by/4.0/legalcode
+### Third-party software
+Use of the third-party software, libraries or code referred to in the
+[Acknowledgements](#acknowledgements) section above may be governed by separate
+terms and conditions or license provisions. Your use of the third-party
+software, libraries or code is subject to any such terms and you should check
+that you can comply with any applicable restrictions or terms and conditions
+before use.
+### Mirrored Databases
+The following databases have been mirrored by DeepMind, and are available with
+reference to the following:
+*   [BFD](https://bfd.mmseqs.com/) (unmodified), by Steinegger M. and Söding J.,
+    available under a
+    [Creative Commons Attribution-ShareAlike 4.0 International License](http://creativecommons.org/licenses/by-sa/4.0/).
+*   [BFD](https://bfd.mmseqs.com/) (modified), by Steinegger M. and Söding J.,
+    modified by DeepMind, available under a
+    [Creative Commons Attribution-ShareAlike 4.0 International License](http://creativecommons.org/licenses/by-sa/4.0/).
+    See the Methods section of the
+    [AlphaFold proteome paper](https://www.nature.com/articles/s41586-021-03828-1)
+    for details.
+*   [Uniref30: v2021_03](http://wwwuser.gwdg.de/~compbiol/uniclust/2021_03/)
+    (unmodified), by Mirdita M. et al., available under a
+    [Creative Commons Attribution-ShareAlike 4.0 International License](http://creativecommons.org/licenses/by-sa/4.0/).
+*   [MGnify: v2022_05](http://ftp.ebi.ac.uk/pub/databases/metagenomics/peptide_database/2022_05/README.txt)
+    (unmodified), by Mitchell AL et al., available free of all copyright
+    restrictions and made fully and freely available for both non-commercial and
+    commercial use under
+    [CC0 1.0 Universal (CC0 1.0) Public Domain Dedication](https://creativecommons.org/publicdomain/zero/1.0/).
--- a/alphafold/common/confidence.py
+++ b/alphafold/common/confidence.py
@@ -14,7 +14,9 @@
 """Functions for processing confidence metrics."""
+import json
 from typing import Dict, Optional, Tuple
 import numpy as np
 import scipy.special
@@ -36,6 +38,43 @@ def compute_plddt(logits: np.ndarray) -> np.ndarray:
  return predicted_lddt_ca * 100
+def _confidence_category(score: float) -> str:
+  """Categorizes pLDDT into: disordered (D), low (L), medium (M), high (H)."""
+  if 0 <= score < 50:
+    return 'D'
+  if 50 <= score < 70:
+    return 'L'
+  elif 70 <= score < 90:
+    return 'M'
+  elif 90 <= score <= 100:
+    return 'H'
+  else:
+    raise ValueError(f'Invalid pLDDT score {score}')
+def confidence_json(plddt: np.ndarray) -> str:
+  """Returns JSON with confidence score and category for every residue.
+  Args:
+    plddt: Per-residue confidence metric data.
+  Returns:
+    String with a formatted JSON.
+  Raises:
+    ValueError: If `plddt` has a rank different than 1.
+  """
+  if plddt.ndim != 1:
+    raise ValueError(f'The plddt array must be rank 1, got: {plddt.shape}.')
+  confidence = {
+      'residueNumber': list(range(1, len(plddt) + 1)),
+      'confidenceScore': [round(float(s), 2) for s in plddt],
+      'confidenceCategory': [_confidence_category(s) for s in plddt],
+  }
+  return json.dumps(confidence, indent=None, separators=(',', ':'))
 def _calculate_bin_centers(breaks: np.ndarray):
  """Gets the bin centers from the bin edges.
@@ -108,6 +147,32 @@ def compute_predicted_aligned_error(
  }
+def pae_json(pae: np.ndarray, max_pae: float) -> str:
+  """Returns the PAE in the same format as is used in the AFDB.
+  Note that the values are presented as floats to 1 decimal place, whereas AFDB
+  returns integer values.
+  Args:
+    pae: The n_res x n_res PAE array.
+    max_pae: The maximum possible PAE value.
+  Returns:
+    PAE output format as a JSON string.
+  """
+  # Check the PAE array is the correct shape.
+  if pae.ndim != 2 or pae.shape[0] != pae.shape[1]:
+    raise ValueError(f'PAE must be a square matrix, got {pae.shape}')
+  # Round the predicted aligned errors to 1 decimal place.
+  rounded_errors = np.round(pae.astype(np.float64), decimals=1)
+  formatted_output = [{
+      'predicted_aligned_error': rounded_errors.tolist(),
+      'max_predicted_aligned_error': max_pae,
+  }]
+  return json.dumps(formatted_output, indent=None, separators=(',', ':'))
 def predicted_tm_score(
    logits: np.ndarray,
    breaks: np.ndarray,

--- a/alphafold/common/confidence_test.py
+++ b/alphafold/common/confidence_test.py
+# Copyright 2023 DeepMind Technologies Limited
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#      http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Test confidence metrics."""
+from absl.testing import absltest
+from alphafold.common import confidence
+import numpy as np
+class ConfidenceTest(absltest.TestCase):
+  def test_pae_json(self):
+    pae = np.array([[0.01, 13.12345], [20.0987, 0.0]])
+    pae_json = confidence.pae_json(pae=pae, max_pae=31.75)
+    self.assertEqual(
+        pae_json, '[{"predicted_aligned_error":[[0.0,13.1],[20.1,0.0]],'
+        '"max_predicted_aligned_error":31.75}]')
+  def test_confidence_json(self):
+    plddt = np.array([42, 42.42])
+    confidence_json = confidence.confidence_json(plddt=plddt)
+    print(confidence_json)
+    self.assertEqual(
+        confidence_json,
+        ('{"residueNumber":[1,2],'
+         '"confidenceScore":[42.0,42.42],'
+         '"confidenceCategory":["D","D"]}'),
+    )
+if __name__ == '__main__':
+  absltest.main()
--- a/alphafold/common/mmcif_metadata.py
+++ b/alphafold/common/mmcif_metadata.py
+# Copyright 2021 DeepMind Technologies Limited
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#      http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""mmCIF metadata."""
+from typing import Mapping, Sequence
+from alphafold import version
+import numpy as np
+_DISCLAIMER = """ALPHAFOLD DATA, COPYRIGHT (2021) DEEPMIND TECHNOLOGIES LIMITED.
+THE INFORMATION PROVIDED IS THEORETICAL MODELLING ONLY AND CAUTION SHOULD BE
+EXERCISED IN ITS USE. IT IS PROVIDED "AS-IS" WITHOUT ANY WARRANTY OF ANY KIND,
+WHETHER EXPRESSED OR IMPLIED. NO WARRANTY IS GIVEN THAT USE OF THE INFORMATION
+SHALL NOT INFRINGE THE RIGHTS OF ANY THIRD PARTY. DISCLAIMER: THE INFORMATION IS
+NOT INTENDED TO BE A SUBSTITUTE FOR PROFESSIONAL MEDICAL ADVICE, DIAGNOSIS, OR
+TREATMENT, AND DOES NOT CONSTITUTE MEDICAL OR OTHER PROFESSIONAL ADVICE. IT IS
+AVAILABLE FOR ACADEMIC AND COMMERCIAL PURPOSES, UNDER CC-BY 4.0 LICENCE."""
+# Authors of the Nature methods paper we reference in the mmCIF.
+_MMCIF_PAPER_AUTHORS = (
+    'Jumper, John',
+    'Evans, Richard',
+    'Pritzel, Alexander',
+    'Green, Tim',
+    'Figurnov, Michael',
+    'Ronneberger, Olaf',
+    'Tunyasuvunakool, Kathryn',
+    'Bates, Russ',
+    'Zidek, Augustin',
+    'Potapenko, Anna',
+    'Bridgland, Alex',
+    'Meyer, Clemens',
+    'Kohl, Simon A. A.',
+    'Ballard, Andrew J.',
+    'Cowie, Andrew',
+    'Romera-Paredes, Bernardino',
+    'Nikolov, Stanislav',
+    'Jain, Rishub',
+    'Adler, Jonas',
+    'Back, Trevor',
+    'Petersen, Stig',
+    'Reiman, David',
+    'Clancy, Ellen',
+    'Zielinski, Michal',
+    'Steinegger, Martin',
+    'Pacholska, Michalina',
+    'Berghammer, Tamas',
+    'Silver, David',
+    'Vinyals, Oriol',
+    'Senior, Andrew W.',
+    'Kavukcuoglu, Koray',
+    'Kohli, Pushmeet',
+    'Hassabis, Demis',
+)
+# Authors of the mmCIF - we set them to be equal to the authors of the paper.
+_MMCIF_AUTHORS = _MMCIF_PAPER_AUTHORS
+def add_metadata_to_mmcif(
+    old_cif: Mapping[str, Sequence[str]], model_type: str
+) -> Mapping[str, Sequence[str]]:
+  """Adds AlphaFold metadata in the given mmCIF."""
+  cif = {}
+  # ModelCIF conformation dictionary.
+  cif['_audit_conform.dict_name'] = ['mmcif_ma.dic']
+  cif['_audit_conform.dict_version'] = ['1.3.9']
+  cif['_audit_conform.dict_location'] = [
+      'https://raw.githubusercontent.com/ihmwg/ModelCIF/master/dist/'
+      'mmcif_ma.dic'
+  ]
+  # License and disclaimer.
+  cif['_pdbx_data_usage.id'] = ['1', '2']
+  cif['_pdbx_data_usage.type'] = ['license', 'disclaimer']
+  cif['_pdbx_data_usage.details'] = [
+      'Data in this file is available under a CC-BY-4.0 license.',
+      _DISCLAIMER,
+  ]
+  cif['_pdbx_data_usage.url'] = [
+      'https://creativecommons.org/licenses/by/4.0/',
+      '?',
+  ]
+  cif['_pdbx_data_usage.name'] = ['CC-BY-4.0', '?']
+  # Structure author details.
+  cif['_audit_author.name'] = []
+  cif['_audit_author.pdbx_ordinal'] = []
+  for author_index, author_name in enumerate(_MMCIF_AUTHORS, start=1):
+    cif['_audit_author.name'].append(author_name)
+    cif['_audit_author.pdbx_ordinal'].append(str(author_index))
+  # Paper author details.
+  cif['_citation_author.citation_id'] = []
+  cif['_citation_author.name'] = []
+  cif['_citation_author.ordinal'] = []
+  for author_index, author_name in enumerate(_MMCIF_PAPER_AUTHORS, start=1):
+    cif['_citation_author.citation_id'].append('primary')
+    cif['_citation_author.name'].append(author_name)
+    cif['_citation_author.ordinal'].append(str(author_index))
+  # Paper citation details.
+  cif['_citation.id'] = ['primary']
+  cif['_citation.title'] = [
+      'Highly accurate protein structure prediction with AlphaFold'
+  ]
+  cif['_citation.journal_full'] = ['Nature']
+  cif['_citation.journal_volume'] = ['596']
+  cif['_citation.page_first'] = ['583']
+  cif['_citation.page_last'] = ['589']
+  cif['_citation.year'] = ['2021']
+  cif['_citation.journal_id_ASTM'] = ['NATUAS']
+  cif['_citation.country'] = ['UK']
+  cif['_citation.journal_id_ISSN'] = ['0028-0836']
+  cif['_citation.journal_id_CSD'] = ['0006']
+  cif['_citation.book_publisher'] = ['?']
+  cif['_citation.pdbx_database_id_PubMed'] = ['34265844']
+  cif['_citation.pdbx_database_id_DOI'] = ['10.1038/s41586-021-03819-2']
+  # Type of data in the dataset including data used in the model generation.
+  cif['_ma_data.id'] = ['1']
+  cif['_ma_data.name'] = ['Model']
+  cif['_ma_data.content_type'] = ['model coordinates']
+  # Description of number of instances for each entity.
+  cif['_ma_target_entity_instance.asym_id'] = old_cif['_struct_asym.id']
+  cif['_ma_target_entity_instance.entity_id'] = old_cif[
+      '_struct_asym.entity_id'
+  ]
+  cif['_ma_target_entity_instance.details'] = ['.'] * len(
+      cif['_ma_target_entity_instance.entity_id']
+  )
+  # Details about the target entities.
+  cif['_ma_target_entity.entity_id'] = cif[
+      '_ma_target_entity_instance.entity_id'
+  ]
+  cif['_ma_target_entity.data_id'] = ['1'] * len(
+      cif['_ma_target_entity.entity_id']
+  )
+  cif['_ma_target_entity.origin'] = ['.'] * len(
+      cif['_ma_target_entity.entity_id']
+  )
+  # Details of the models being deposited.
+  cif['_ma_model_list.ordinal_id'] = ['1']
+  cif['_ma_model_list.model_id'] = ['1']
+  cif['_ma_model_list.model_group_id'] = ['1']
+  cif['_ma_model_list.model_name'] = ['Top ranked model']
+  cif['_ma_model_list.model_group_name'] = [
+      f'AlphaFold {model_type} v{version.__version__} model'
+  ]
+  cif['_ma_model_list.data_id'] = ['1']
+  cif['_ma_model_list.model_type'] = ['Ab initio model']
+  # Software used.
+  cif['_software.pdbx_ordinal'] = ['1']
+  cif['_software.name'] = ['AlphaFold']
+  cif['_software.version'] = [f'v{version.__version__}']
+  cif['_software.type'] = ['package']
+  cif['_software.description'] = ['Structure prediction']
+  cif['_software.classification'] = ['other']
+  cif['_software.date'] = ['?']
+  # Collection of software into groups.
+  cif['_ma_software_group.ordinal_id'] = ['1']
+  cif['_ma_software_group.group_id'] = ['1']
+  cif['_ma_software_group.software_id'] = ['1']
+  # Method description to conform with ModelCIF.
+  cif['_ma_protocol_step.ordinal_id'] = ['1', '2', '3']
+  cif['_ma_protocol_step.protocol_id'] = ['1', '1', '1']
+  cif['_ma_protocol_step.step_id'] = ['1', '2', '3']
+  cif['_ma_protocol_step.method_type'] = [
+      'coevolution MSA',
+      'template search',
+      'modeling',
+  ]
+  # Details of the metrics use to assess model confidence.
+  cif['_ma_qa_metric.id'] = ['1', '2']
+  cif['_ma_qa_metric.name'] = ['pLDDT', 'pLDDT']
+  # Accepted values are distance, energy, normalised score, other, zscore.
+  cif['_ma_qa_metric.type'] = ['pLDDT', 'pLDDT']
+  cif['_ma_qa_metric.mode'] = ['global', 'local']
+  cif['_ma_qa_metric.software_group_id'] = ['1', '1']
+  # Global model confidence metric value.
+  cif['_ma_qa_metric_global.ordinal_id'] = ['1']
+  cif['_ma_qa_metric_global.model_id'] = ['1']
+  cif['_ma_qa_metric_global.metric_id'] = ['1']
+  global_plddt = np.mean(
+      [float(v) for v in old_cif['_atom_site.B_iso_or_equiv']]
+  )
+  cif['_ma_qa_metric_global.metric_value'] = [f'{global_plddt:.2f}']
+  cif['_atom_type.symbol'] = sorted(set(old_cif['_atom_site.type_symbol']))
+  return cif
--- a/alphafold/common/protein.py
+++ b/alphafold/common/protein.py
@@ -13,11 +13,18 @@
 # limitations under the License.
 """Protein data type."""
+import collections
 import dataclasses
+import functools
 import io
-from typing import Any, Mapping, Optional
+from typing import Any, Dict, List, Mapping, Optional, Tuple
+from alphafold.common import mmcif_metadata
 from alphafold.common import residue_constants
+from Bio.PDB import MMCIFParser
 from Bio.PDB import PDBParser
+from Bio.PDB.mmcifio import MMCIFIO
+from Bio.PDB.Structure import Structure
 import numpy as np
 FeatureDict = Mapping[str, np.ndarray]
@@ -27,6 +34,32 @@ ModelOutput = Mapping[str, Any]  # Is a nested dict.
 PDB_CHAIN_IDS = 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789'
 PDB_MAX_CHAINS = len(PDB_CHAIN_IDS)  # := 62.
+# Data to fill the _chem_comp table when writing mmCIFs.
+_CHEM_COMP: Mapping[str, Tuple[Tuple[str, str], ...]] = {
+    'L-peptide linking': (
+        ('ALA', 'ALANINE'),
+        ('ARG', 'ARGININE'),
+        ('ASN', 'ASPARAGINE'),
+        ('ASP', 'ASPARTIC ACID'),
+        ('CYS', 'CYSTEINE'),
+        ('GLN', 'GLUTAMINE'),
+        ('GLU', 'GLUTAMIC ACID'),
+        ('HIS', 'HISTIDINE'),
+        ('ILE', 'ISOLEUCINE'),
+        ('LEU', 'LEUCINE'),
+        ('LYS', 'LYSINE'),
+        ('MET', 'METHIONINE'),
+        ('PHE', 'PHENYLALANINE'),
+        ('PRO', 'PROLINE'),
+        ('SER', 'SERINE'),
+        ('THR', 'THREONINE'),
+        ('TRP', 'TRYPTOPHAN'),
+        ('TYR', 'TYROSINE'),
+        ('VAL', 'VALINE'),
+    ),
+    'peptide linking': (('GLY', 'GLYCINE'),),
+}
 @dataclasses.dataclass(frozen=True)
 class Protein:
@@ -63,27 +96,32 @@ class Protein:
          'because these cannot be written to PDB format.')
-def from_pdb_string(pdb_str: str, chain_id: Optional[str] = None) -> Protein:
+def _from_bio_structure(
-  """Takes a PDB string and constructs a Protein object.
+    structure: Structure, chain_id: Optional[str] = None
+) -> Protein:
+  """Takes a Biopython structure and creates a `Protein` instance.
  WARNING: All non-standard residue types will be converted into UNK. All
    non-standard atoms will be ignored.
  Args:
-    pdb_str: The contents of the pdb file
+    structure: Structure from the Biopython library.
-    chain_id: If chain_id is specified (e.g. A), then only that chain
+    chain_id: If chain_id is specified (e.g. A), then only that chain is parsed.
-      is parsed. Otherwise all chains are parsed.
+      Otherwise all chains are parsed.
  Returns:
-    A new `Protein` parsed from the pdb contents.
+    A new `Protein` created from the structure contents.
+  Raises:
+    ValueError: If the number of models included in the structure is not 1.
+    ValueError: If insertion code is detected at a residue.
  """
-  pdb_fh = io.StringIO(pdb_str)
-  parser = PDBParser(QUIET=True)
-  structure = parser.get_structure('none', pdb_fh)
  models = list(structure.get_models())
  if len(models) != 1:
    raise ValueError(
-        f'Only single model PDBs are supported. Found {len(models)} models.')
+        'Only single model PDBs/mmCIFs are supported. Found'
+        f' {len(models)} models.'
+    )
  model = models[0]
  atom_positions = []
@@ -99,8 +137,9 @@ def from_pdb_string(pdb_str: str, chain_id: Optional[str] = None) -> Protein:
    for res in chain:
      if res.id[2] != ' ':
        raise ValueError(
-            f'PDB contains an insertion code at chain {chain.id} and residue '
+            f'PDB/mmCIF contains an insertion code at chain {chain.id} and'
-            f'index {res.id[1]}. These are not supported.')
+            f' residue index {res.id[1]}. These are not supported.'
+        )
      res_shortname = residue_constants.restype_3to1.get(res.resname, 'X')
      restype_idx = residue_constants.restype_order.get(
          res_shortname, residue_constants.restype_num)
@@ -137,6 +176,48 @@ def from_pdb_string(pdb_str: str, chain_id: Optional[str] = None) -> Protein:
      b_factors=np.array(b_factors))
+def from_pdb_string(pdb_str: str, chain_id: Optional[str] = None) -> Protein:
+  """Takes a PDB string and constructs a `Protein` object.
+  WARNING: All non-standard residue types will be converted into UNK. All
+    non-standard atoms will be ignored.
+  Args:
+    pdb_str: The contents of the pdb file
+    chain_id: If chain_id is specified (e.g. A), then only that chain is parsed.
+      Otherwise all chains are parsed.
+  Returns:
+    A new `Protein` parsed from the pdb contents.
+  """
+  with io.StringIO(pdb_str) as pdb_fh:
+    parser = PDBParser(QUIET=True)
+    structure = parser.get_structure(id='none', file=pdb_fh)
+    return _from_bio_structure(structure, chain_id)
+def from_mmcif_string(
+    mmcif_str: str, chain_id: Optional[str] = None
+) -> Protein:
+  """Takes a mmCIF string and constructs a `Protein` object.
+  WARNING: All non-standard residue types will be converted into UNK. All
+    non-standard atoms will be ignored.
+  Args:
+    mmcif_str: The contents of the mmCIF file
+    chain_id: If chain_id is specified (e.g. A), then only that chain is parsed.
+      Otherwise all chains are parsed.
+  Returns:
+    A new `Protein` parsed from the mmCIF contents.
+  """
+  with io.StringIO(mmcif_str) as mmcif_fh:
+    parser = MMCIFParser(QUIET=True)
+    structure = parser.get_structure(structure_id='none', filename=mmcif_fh)
+    return _from_bio_structure(structure, chain_id)
 def _chain_end(atom_index, end_resname, chain_name, residue_index) -> str:
  chain_end = 'TER'
  return (f'{chain_end:<6}{atom_index:>5}      {end_resname:>3} '
@@ -276,3 +357,223 @@ def from_prediction(
      residue_index=_maybe_remove_leading_dim(features['residue_index']) + 1,
      chain_index=chain_index,
      b_factors=b_factors)
+def to_mmcif(
+    prot: Protein,
+    file_id: str,
+    model_type: str,
+) -> str:
+  """Converts a `Protein` instance to an mmCIF string.
+  WARNING 1: The _entity_poly_seq is filled with unknown (UNK) residues for any
+    missing residue indices in the range from min(1, min(residue_index)) to
+    max(residue_index). E.g. for a protein object with positions for residues
+    2 (MET), 3 (LYS), 6 (GLY), this method would set the _entity_poly_seq to:
+    1 UNK
+    2 MET
+    3 LYS
+    4 UNK
+    5 UNK
+    6 GLY
+    This is done to preserve the residue numbering.
+  WARNING 2: Converting ground truth mmCIF file to Protein and then back to
+    mmCIF using this method will convert all non-standard residue types to UNK.
+    If you need this behaviour, you need to store more mmCIF metadata in the
+    Protein object (e.g. all fields except for the _atom_site loop).
+  WARNING 3: Converting ground truth mmCIF file to Protein and then back to
+    mmCIF using this method will not retain the original chain indices.
+  WARNING 4: In case of multiple identical chains, they are assigned different
+    `_atom_site.label_entity_id` values.
+  Args:
+    prot: A protein to convert to mmCIF string.
+    file_id: The file ID (usually the PDB ID) to be used in the mmCIF.
+    model_type: 'Multimer' or 'Monomer'.
+  Returns:
+    A valid mmCIF string.
+  Raises:
+    ValueError: If aminoacid types array contains entries with too many protein
+    types.
+  """
+  atom_mask = prot.atom_mask
+  aatype = prot.aatype
+  atom_positions = prot.atom_positions
+  residue_index = prot.residue_index.astype(np.int32)
+  chain_index = prot.chain_index.astype(np.int32)
+  b_factors = prot.b_factors
+  # Construct a mapping from chain integer indices to chain ID strings.
+  chain_ids = {}
+  # We count unknown residues as protein residues.
+  for entity_id in np.unique(chain_index):  # np.unique gives sorted output.
+    chain_ids[entity_id] = _int_id_to_str_id(entity_id + 1)
+  mmcif_dict = collections.defaultdict(list)
+  mmcif_dict['data_'] = file_id.upper()
+  mmcif_dict['_entry.id'] = file_id.upper()
+  label_asym_id_to_entity_id = {}
+  # Entity and chain information.
+  for entity_id, chain_id in chain_ids.items():
+    # Add all chain information to the _struct_asym table.
+    label_asym_id_to_entity_id[str(chain_id)] = str(entity_id)
+    mmcif_dict['_struct_asym.id'].append(chain_id)
+    mmcif_dict['_struct_asym.entity_id'].append(str(entity_id))
+    # Add information about the entity to the _entity_poly table.
+    mmcif_dict['_entity_poly.entity_id'].append(str(entity_id))
+    mmcif_dict['_entity_poly.type'].append(residue_constants.PROTEIN_CHAIN)
+    mmcif_dict['_entity_poly.pdbx_strand_id'].append(chain_id)
+    # Generate the _entity table.
+    mmcif_dict['_entity.id'].append(str(entity_id))
+    mmcif_dict['_entity.type'].append(residue_constants.POLYMER_CHAIN)
+  # Add the residues to the _entity_poly_seq table.
+  for entity_id, (res_ids, aas) in _get_entity_poly_seq(
+      aatype, residue_index, chain_index
+  ).items():
+    for res_id, aa in zip(res_ids, aas):
+      mmcif_dict['_entity_poly_seq.entity_id'].append(str(entity_id))
+      mmcif_dict['_entity_poly_seq.num'].append(str(res_id))
+      mmcif_dict['_entity_poly_seq.mon_id'].append(
+          residue_constants.resnames[aa]
+      )
+  # Populate the chem comp table.
+  for chem_type, chem_comp in _CHEM_COMP.items():
+    for chem_id, chem_name in chem_comp:
+      mmcif_dict['_chem_comp.id'].append(chem_id)
+      mmcif_dict['_chem_comp.type'].append(chem_type)
+      mmcif_dict['_chem_comp.name'].append(chem_name)
+  # Add all atom sites.
+  atom_index = 1
+  for i in range(aatype.shape[0]):
+    res_name_3 = residue_constants.resnames[aatype[i]]
+    if aatype[i] <= len(residue_constants.restypes):
+      atom_names = residue_constants.atom_types
+    else:
+      raise ValueError(
+          'Amino acid types array contains entries with too many protein types.'
+      )
+    for atom_name, pos, mask, b_factor in zip(
+        atom_names, atom_positions[i], atom_mask[i], b_factors[i]
+    ):
+      if mask < 0.5:
+        continue
+      type_symbol = residue_constants.atom_id_to_type(atom_name)
+      mmcif_dict['_atom_site.group_PDB'].append('ATOM')
+      mmcif_dict['_atom_site.id'].append(str(atom_index))
+      mmcif_dict['_atom_site.type_symbol'].append(type_symbol)
+      mmcif_dict['_atom_site.label_atom_id'].append(atom_name)
+      mmcif_dict['_atom_site.label_alt_id'].append('.')
+      mmcif_dict['_atom_site.label_comp_id'].append(res_name_3)
+      mmcif_dict['_atom_site.label_asym_id'].append(chain_ids[chain_index[i]])
+      mmcif_dict['_atom_site.label_entity_id'].append(
+          label_asym_id_to_entity_id[chain_ids[chain_index[i]]]
+      )
+      mmcif_dict['_atom_site.label_seq_id'].append(str(residue_index[i]))
+      mmcif_dict['_atom_site.pdbx_PDB_ins_code'].append('.')
+      mmcif_dict['_atom_site.Cartn_x'].append(f'{pos[0]:.3f}')
+      mmcif_dict['_atom_site.Cartn_y'].append(f'{pos[1]:.3f}')
+      mmcif_dict['_atom_site.Cartn_z'].append(f'{pos[2]:.3f}')
+      mmcif_dict['_atom_site.occupancy'].append('1.00')
+      mmcif_dict['_atom_site.B_iso_or_equiv'].append(f'{b_factor:.2f}')
+      mmcif_dict['_atom_site.auth_seq_id'].append(str(residue_index[i]))
+      mmcif_dict['_atom_site.auth_asym_id'].append(chain_ids[chain_index[i]])
+      mmcif_dict['_atom_site.pdbx_PDB_model_num'].append('1')
+      atom_index += 1
+  metadata_dict = mmcif_metadata.add_metadata_to_mmcif(mmcif_dict, model_type)
+  mmcif_dict.update(metadata_dict)
+  return _create_mmcif_string(mmcif_dict)
+@functools.lru_cache(maxsize=256)
+def _int_id_to_str_id(num: int) -> str:
+  """Encodes a number as a string, using reverse spreadsheet style naming.
+  Args:
+    num: A positive integer.
+  Returns:
+    A string that encodes the positive integer using reverse spreadsheet style,
+    naming e.g. 1 = A, 2 = B, ..., 27 = AA, 28 = BA, 29 = CA, ... This is the
+    usual way to encode chain IDs in mmCIF files.
+  """
+  if num <= 0:
+    raise ValueError(f'Only positive integers allowed, got {num}.')
+  num = num - 1  # 1-based indexing.
+  output = []
+  while num >= 0:
+    output.append(chr(num % 26 + ord('A')))
+    num = num // 26 - 1
+  return ''.join(output)
+def _get_entity_poly_seq(
+    aatypes: np.ndarray, residue_indices: np.ndarray, chain_indices: np.ndarray
+) -> Dict[int, Tuple[List[int], List[int]]]:
+  """Constructs gapless residue index and aatype lists for each chain.
+  Args:
+    aatypes: A numpy array with aatypes.
+    residue_indices: A numpy array with residue indices.
+    chain_indices: A numpy array with chain indices.
+  Returns:
+    A dictionary mapping chain indices to a tuple with list of residue indices
+    and a list of aatypes. Missing residues are filled with UNK residue type.
+  """
+  if (
+      aatypes.shape[0] != residue_indices.shape[0]
+      or aatypes.shape[0] != chain_indices.shape[0]
+  ):
+    raise ValueError(
+        'aatypes, residue_indices, chain_indices must have the same length.'
+    )
+  # Group the present residues by chain index.
+  present = collections.defaultdict(list)
+  for chain_id, res_id, aa in zip(chain_indices, residue_indices, aatypes):
+    present[chain_id].append((res_id, aa))
+  # Add any missing residues (from 1 to the first residue and for any gaps).
+  entity_poly_seq = {}
+  for chain_id, present_residues in present.items():
+    present_residue_indices = set([x[0] for x in present_residues])
+    min_res_id = min(present_residue_indices)  # Could be negative.
+    max_res_id = max(present_residue_indices)
+    new_residue_indices = []
+    new_aatypes = []
+    present_index = 0
+    for i in range(min(1, min_res_id), max_res_id + 1):
+      new_residue_indices.append(i)
+      if i in present_residue_indices:
+        new_aatypes.append(present_residues[present_index][1])
+        present_index += 1
+      else:
+        new_aatypes.append(20)  # Unknown amino acid type.
+    entity_poly_seq[chain_id] = (new_residue_indices, new_aatypes)
+  return entity_poly_seq
+def _create_mmcif_string(mmcif_dict: Dict[str, Any]) -> str:
+  """Converts mmCIF dictionary into mmCIF string."""
+  mmcifio = MMCIFIO()
+  mmcifio.set_dict(mmcif_dict)
+  with io.StringIO() as file_handle:
+    mmcifio.save(file_handle)
+    return file_handle.getvalue()
--- a/alphafold/common/protein_test.py
+++ b/alphafold/common/protein_test.py
@@ -82,16 +82,55 @@ class ProteinTest(parameterized.TestCase):
    np.testing.assert_array_almost_equal(
        prot_reconstr.b_factors, prot.b_factors)
+  @parameterized.named_parameters(
+      dict(
+          testcase_name='glucagon',
+          pdb_file='glucagon.pdb',
+          model_type='Monomer',
+      ),
+      dict(testcase_name='7bui', pdb_file='5nmu.pdb', model_type='Multimer'),
+  )
+  def test_to_mmcif(self, pdb_file, model_type):
+    with open(
+        os.path.join(
+            absltest.get_default_test_srcdir(), TEST_DATA_DIR, pdb_file
+        )
+    ) as f:
+      pdb_string = f.read()
+    prot = protein.from_pdb_string(pdb_string)
+    file_id = 'test'
+    mmcif_string = protein.to_mmcif(prot, file_id, model_type)
+    prot_reconstr = protein.from_mmcif_string(mmcif_string)
+    np.testing.assert_array_equal(prot_reconstr.aatype, prot.aatype)
+    np.testing.assert_array_almost_equal(
+        prot_reconstr.atom_positions, prot.atom_positions
+    )
+    np.testing.assert_array_almost_equal(
+        prot_reconstr.atom_mask, prot.atom_mask
+    )
+    np.testing.assert_array_equal(
+        prot_reconstr.residue_index, prot.residue_index
+    )
+    np.testing.assert_array_equal(prot_reconstr.chain_index, prot.chain_index)
+    np.testing.assert_array_almost_equal(
+        prot_reconstr.b_factors, prot.b_factors
+    )
  def test_ideal_atom_mask(self):
    with open(
-        os.path.join(absltest.get_default_test_srcdir(), TEST_DATA_DIR,
+        os.path.join(
-                     '2rbg.pdb')) as f:
+            absltest.get_default_test_srcdir(), TEST_DATA_DIR, '2rbg.pdb'
+        )
+    ) as f:
      pdb_string = f.read()
    prot = protein.from_pdb_string(pdb_string)
    ideal_mask = protein.ideal_atom_mask(prot)
    non_ideal_residues = set([102] + list(range(127, 286)))
    for i, (res, atom_mask) in enumerate(
-        zip(prot.residue_index, prot.atom_mask)):
+        zip(prot.residue_index, prot.atom_mask)
+    ):
      if res in non_ideal_residues:
        self.assertFalse(np.all(atom_mask == ideal_mask[i]), msg=f'{res}')
      else:

--- a/alphafold/common/residue_constants.py
+++ b/alphafold/common/residue_constants.py
@@ -17,7 +17,7 @@
 import collections
 import functools
 import os
-from typing import List, Mapping, Tuple
+from typing import Final, List, Mapping, Tuple
 import numpy as np
 import tree
@@ -609,6 +609,35 @@ restype_1to3 = {
    'V': 'VAL',
 }
+PROTEIN_CHAIN: Final[str] = 'polypeptide(L)'
+POLYMER_CHAIN: Final[str] = 'polymer'
+def atom_id_to_type(atom_id: str) -> str:
+  """Convert atom ID to atom type, works only for standard protein residues.
+  Args:
+    atom_id: Atom ID to be converted.
+  Returns:
+    String corresponding to atom type.
+  Raises:
+    ValueError: If atom ID not recognized.
+  """
+  if atom_id.startswith('C'):
+    return 'C'
+  elif atom_id.startswith('N'):
+    return 'N'
+  elif atom_id.startswith('O'):
+    return 'O'
+  elif atom_id.startswith('H'):
+    return 'H'
+  elif atom_id.startswith('S'):
+    return 'S'
+  raise ValueError('Atom ID not recognized.')
 # NB: restype_3to1 differs from Bio.PDB.protein_letters_3to1 by being a simple
 # 1-to-1 mapping of 3 letter names to one letter names. The latter contains

--- a/alphafold/common/testdata/5nmu.pdb
+++ b/alphafold/common/testdata/5nmu.pdb
--- a/alphafold/common/testdata/glucagon.pdb
+++ b/alphafold/common/testdata/glucagon.pdb
+HEADER    HORMONE                                 17-OCT-77   1GCN              
+TITLE     X-RAY ANALYSIS OF GLUCAGON AND ITS RELATIONSHIP TO RECEPTOR           
+TITLE    2 BINDING                                                              
+COMPND    MOL_ID: 1;                                                            
+COMPND   2 MOLECULE: GLUCAGON;                                                  
+COMPND   3 CHAIN: A;                                                            
+COMPND   4 ENGINEERED: YES                                                      
+SOURCE    MOL_ID: 1;                                                            
+SOURCE   2 ORGANISM_SCIENTIFIC: SUS SCROFA;                                     
+SOURCE   3 ORGANISM_COMMON: PIG;                                                
+SOURCE   4 ORGANISM_TAXID: 9823                                                 
+KEYWDS    HORMONE                                                               
+EXPDTA    X-RAY DIFFRACTION                                                     
+AUTHOR    T.L.BLUNDELL,K.SASAKI,S.DOCKERILL,I.J.TICKLE                          
+REVDAT   6   24-FEB-09 1GCN    1       VERSN                                    
+REVDAT   5   30-SEP-83 1GCN    1       REVDAT                                   
+REVDAT   4   31-DEC-80 1GCN    1       REMARK                                   
+REVDAT   3   22-OCT-79 1GCN    3       ATOM                                     
+REVDAT   2   29-AUG-79 1GCN    3       CRYST1                                   
+REVDAT   1   28-NOV-77 1GCN    0                                                
+JRNL        AUTH   K.SASAKI,S.DOCKERILL,D.A.ADAMIAK,I.J.TICKLE,                 
+JRNL        AUTH 2 T.BLUNDELL                                                   
+JRNL        TITL   X-RAY ANALYSIS OF GLUCAGON AND ITS RELATIONSHIP TO           
+JRNL        TITL 2 RECEPTOR BINDING.                                            
+JRNL        REF    NATURE                        V. 257   751 1975              
+JRNL        REFN                   ISSN 0028-0836                               
+JRNL        PMID   171582                                                       
+JRNL        DOI    10.1038/257751A0                                             
+REMARK   1                                                                      
+REMARK   1 REFERENCE 1                                                          
+REMARK   1  EDIT   M.O.DAYHOFF                                                  
+REMARK   1  REF    ATLAS OF PROTEIN SEQUENCE     V.   5   125 1976              
+REMARK   1  REF  2 AND STRUCTURE,SUPPLEMENT 2                                   
+REMARK   1  PUBL   NATIONAL BIOMEDICAL RESEARCH FOUNDATION, SILVER              
+REMARK   1  PUBL 2 SPRING,MD.                                                   
+REMARK   1  REFN                   ISSN 0-912466-05-7                           
+REMARK   2                                                                      
+REMARK   2 RESOLUTION.    3.00 ANGSTROMS.                                       
+REMARK   3                                                                      
+REMARK   3 REFINEMENT.                                                          
+REMARK   3   PROGRAM     : NULL                                                 
+REMARK   3   AUTHORS     : NULL                                                 
+REMARK   3                                                                      
+REMARK   3  DATA USED IN REFINEMENT.                                            
+REMARK   3   RESOLUTION RANGE HIGH (ANGSTROMS) : 3.00                           
+REMARK   3   RESOLUTION RANGE LOW  (ANGSTROMS) : NULL                           
+REMARK   3   DATA CUTOFF            (SIGMA(F)) : NULL                           
+REMARK   3   DATA CUTOFF HIGH         (ABS(F)) : NULL                           
+REMARK   3   DATA CUTOFF LOW          (ABS(F)) : NULL                           
+REMARK   3   COMPLETENESS (WORKING+TEST)   (%) : NULL                           
+REMARK   3   NUMBER OF REFLECTIONS             : NULL                           
+REMARK   3                                                                      
+REMARK   3  FIT TO DATA USED IN REFINEMENT.                                     
+REMARK   3   CROSS-VALIDATION METHOD          : NULL                            
+REMARK   3   FREE R VALUE TEST SET SELECTION  : NULL                            
+REMARK   3   R VALUE            (WORKING SET) : NULL                            
+REMARK   3   FREE R VALUE                     : NULL                            
+REMARK   3   FREE R VALUE TEST SET SIZE   (%) : NULL                            
+REMARK   3   FREE R VALUE TEST SET COUNT      : NULL                            
+REMARK   3   ESTIMATED ERROR OF FREE R VALUE  : NULL                            
+REMARK   3                                                                      
+REMARK   3  FIT IN THE HIGHEST RESOLUTION BIN.                                  
+REMARK   3   TOTAL NUMBER OF BINS USED           : NULL                         
+REMARK   3   BIN RESOLUTION RANGE HIGH       (A) : NULL                         
+REMARK   3   BIN RESOLUTION RANGE LOW        (A) : NULL                         
+REMARK   3   BIN COMPLETENESS (WORKING+TEST) (%) : NULL                         
+REMARK   3   REFLECTIONS IN BIN    (WORKING SET) : NULL                         
+REMARK   3   BIN R VALUE           (WORKING SET) : NULL                         
+REMARK   3   BIN FREE R VALUE                    : NULL                         
+REMARK   3   BIN FREE R VALUE TEST SET SIZE  (%) : NULL                         
+REMARK   3   BIN FREE R VALUE TEST SET COUNT     : NULL                         
+REMARK   3   ESTIMATED ERROR OF BIN FREE R VALUE : NULL                         
+REMARK   3                                                                      
+REMARK   3  NUMBER OF NON-HYDROGEN ATOMS USED IN REFINEMENT.                    
+REMARK   3   PROTEIN ATOMS            : 246                                     
+REMARK   3   NUCLEIC ACID ATOMS       : 0                                       
+REMARK   3   HETEROGEN ATOMS          : 0                                       
+REMARK   3   SOLVENT ATOMS            : 0                                       
+REMARK   3                                                                      
+REMARK   3  B VALUES.                                                           
+REMARK   3   FROM WILSON PLOT           (A**2) : NULL                           
+REMARK   3   MEAN B VALUE      (OVERALL, A**2) : NULL                           
+REMARK   3   OVERALL ANISOTROPIC B VALUE.                                       
+REMARK   3    B11 (A**2) : NULL                                                 
+REMARK   3    B22 (A**2) : NULL                                                 
+REMARK   3    B33 (A**2) : NULL                                                 
+REMARK   3    B12 (A**2) : NULL                                                 
+REMARK   3    B13 (A**2) : NULL                                                 
+REMARK   3    B23 (A**2) : NULL                                                 
+REMARK   3                                                                      
+REMARK   3  ESTIMATED COORDINATE ERROR.                                         
+REMARK   3   ESD FROM LUZZATI PLOT        (A) : NULL                            
+REMARK   3   ESD FROM SIGMAA              (A) : NULL                            
+REMARK   3   LOW RESOLUTION CUTOFF        (A) : NULL                            
+REMARK   3                                                                      
+REMARK   3  CROSS-VALIDATED ESTIMATED COORDINATE ERROR.                         
+REMARK   3   ESD FROM C-V LUZZATI PLOT    (A) : NULL                            
+REMARK   3   ESD FROM C-V SIGMAA          (A) : NULL                            
+REMARK   3                                                                      
+REMARK   3  RMS DEVIATIONS FROM IDEAL VALUES.                                   
+REMARK   3   BOND LENGTHS                 (A) : NULL                            
+REMARK   3   BOND ANGLES            (DEGREES) : NULL                            
+REMARK   3   DIHEDRAL ANGLES        (DEGREES) : NULL                            
+REMARK   3   IMPROPER ANGLES        (DEGREES) : NULL                            
+REMARK   3                                                                      
+REMARK   3  ISOTROPIC THERMAL MODEL : NULL                                      
+REMARK   3                                                                      
+REMARK   3  ISOTROPIC THERMAL FACTOR RESTRAINTS.    RMS    SIGMA                
+REMARK   3   MAIN-CHAIN BOND              (A**2) : NULL  ; NULL                 
+REMARK   3   MAIN-CHAIN ANGLE             (A**2) : NULL  ; NULL                 
+REMARK   3   SIDE-CHAIN BOND              (A**2) : NULL  ; NULL                 
+REMARK   3   SIDE-CHAIN ANGLE             (A**2) : NULL  ; NULL                 
+REMARK   3                                                                      
+REMARK   3  NCS MODEL : NULL                                                    
+REMARK   3                                                                      
+REMARK   3  NCS RESTRAINTS.                         RMS   SIGMA/WEIGHT          
+REMARK   3   GROUP  1  POSITIONAL            (A) : NULL  ; NULL                 
+REMARK   3   GROUP  1  B-FACTOR           (A**2) : NULL  ; NULL                 
+REMARK   3                                                                      
+REMARK   3  PARAMETER FILE  1  : NULL                                           
+REMARK   3  TOPOLOGY FILE  1   : NULL                                           
+REMARK   3                                                                      
+REMARK   3  OTHER REFINEMENT REMARKS: NULL                                      
+REMARK   4                                                                      
+REMARK   4 1GCN COMPLIES WITH FORMAT V. 3.15, 01-DEC-08                         
+REMARK 100                                                                      
+REMARK 100 THIS ENTRY HAS BEEN PROCESSED BY BNL.                                
+REMARK 200                                                                      
+REMARK 200 EXPERIMENTAL DETAILS                                                 
+REMARK 200  EXPERIMENT TYPE                : X-RAY DIFFRACTION                  
+REMARK 200  DATE OF DATA COLLECTION        : NULL                               
+REMARK 200  TEMPERATURE           (KELVIN) : NULL                               
+REMARK 200  PH                             : NULL                               
+REMARK 200  NUMBER OF CRYSTALS USED        : NULL                               
+REMARK 200                                                                      
+REMARK 200  SYNCHROTRON              (Y/N) : NULL                               
+REMARK 200  RADIATION SOURCE               : NULL                               
+REMARK 200  BEAMLINE                       : NULL                               
+REMARK 200  X-RAY GENERATOR MODEL          : NULL                               
+REMARK 200  MONOCHROMATIC OR LAUE    (M/L) : NULL                               
+REMARK 200  WAVELENGTH OR RANGE        (A) : NULL                               
+REMARK 200  MONOCHROMATOR                  : NULL                               
+REMARK 200  OPTICS                         : NULL                               
+REMARK 200                                                                      
+REMARK 200  DETECTOR TYPE                  : NULL                               
+REMARK 200  DETECTOR MANUFACTURER          : NULL                               
+REMARK 200  INTENSITY-INTEGRATION SOFTWARE : NULL                               
+REMARK 200  DATA SCALING SOFTWARE          : NULL                               
+REMARK 200                                                                      
+REMARK 200  NUMBER OF UNIQUE REFLECTIONS   : NULL                               
+REMARK 200  RESOLUTION RANGE HIGH      (A) : NULL                               
+REMARK 200  RESOLUTION RANGE LOW       (A) : NULL                               
+REMARK 200  REJECTION CRITERIA  (SIGMA(I)) : NULL                               
+REMARK 200                                                                      
+REMARK 200 OVERALL.                                                             
+REMARK 200  COMPLETENESS FOR RANGE     (%) : NULL                               
+REMARK 200  DATA REDUNDANCY                : NULL                               
+REMARK 200  R MERGE                    (I) : NULL                               
+REMARK 200  R SYM                      (I) : NULL                               
+REMARK 200  <I/SIGMA(I)> FOR THE DATA SET  : NULL                               
+REMARK 200                                                                      
+REMARK 200 IN THE HIGHEST RESOLUTION SHELL.                                     
+REMARK 200  HIGHEST RESOLUTION SHELL, RANGE HIGH (A) : NULL                     
+REMARK 200  HIGHEST RESOLUTION SHELL, RANGE LOW  (A) : NULL                     
+REMARK 200  COMPLETENESS FOR SHELL     (%) : NULL                               
+REMARK 200  DATA REDUNDANCY IN SHELL       : NULL                               
+REMARK 200  R MERGE FOR SHELL          (I) : NULL                               
+REMARK 200  R SYM FOR SHELL            (I) : NULL                               
+REMARK 200  <I/SIGMA(I)> FOR SHELL         : NULL                               
+REMARK 200                                                                      
+REMARK 200 DIFFRACTION PROTOCOL: NULL                                           
+REMARK 200 METHOD USED TO DETERMINE THE STRUCTURE: NULL                         
+REMARK 200 SOFTWARE USED: NULL                                                  
+REMARK 200 STARTING MODEL: NULL                                                 
+REMARK 200                                                                      
+REMARK 200 REMARK: NULL                                                         
+REMARK 280                                                                      
+REMARK 280 CRYSTAL                                                              
+REMARK 280 SOLVENT CONTENT, VS   (%): 50.74                                     
+REMARK 280 MATTHEWS COEFFICIENT, VM (ANGSTROMS**3/DA): 2.50                     
+REMARK 280                                                                      
+REMARK 280 CRYSTALLIZATION CONDITIONS: NULL                                     
+REMARK 290                                                                      
+REMARK 290 CRYSTALLOGRAPHIC SYMMETRY                                            
+REMARK 290 SYMMETRY OPERATORS FOR SPACE GROUP: P 21 3                           
+REMARK 290                                                                      
+REMARK 290      SYMOP   SYMMETRY                                                
+REMARK 290     NNNMMM   OPERATOR                                                
+REMARK 290       1555   X,Y,Z                                                   
+REMARK 290       2555   -X+1/2,-Y,Z+1/2                                         
+REMARK 290       3555   -X,Y+1/2,-Z+1/2                                         
+REMARK 290       4555   X+1/2,-Y+1/2,-Z                                         
+REMARK 290       5555   Z,X,Y                                                   
+REMARK 290       6555   Z+1/2,-X+1/2,-Y                                         
+REMARK 290       7555   -Z+1/2,-X,Y+1/2                                         
+REMARK 290       8555   -Z,X+1/2,-Y+1/2                                         
+REMARK 290       9555   Y,Z,X                                                   
+REMARK 290      10555   -Y,Z+1/2,-X+1/2                                         
+REMARK 290      11555   Y+1/2,-Z+1/2,-X                                         
+REMARK 290      12555   -Y+1/2,-Z,X+1/2                                         
+REMARK 290                                                                      
+REMARK 290     WHERE NNN -> OPERATOR NUMBER                                     
+REMARK 290           MMM -> TRANSLATION VECTOR                                  
+REMARK 290                                                                      
+REMARK 290 CRYSTALLOGRAPHIC SYMMETRY TRANSFORMATIONS                            
+REMARK 290 THE FOLLOWING TRANSFORMATIONS OPERATE ON THE ATOM/HETATM             
+REMARK 290 RECORDS IN THIS ENTRY TO PRODUCE CRYSTALLOGRAPHICALLY                
+REMARK 290 RELATED MOLECULES.                                                   
+REMARK 290   SMTRY1   1  1.000000  0.000000  0.000000        0.00000            
+REMARK 290   SMTRY2   1  0.000000  1.000000  0.000000        0.00000            
+REMARK 290   SMTRY3   1  0.000000  0.000000  1.000000        0.00000            
+REMARK 290   SMTRY1   2 -1.000000  0.000000  0.000000       23.55000            
+REMARK 290   SMTRY2   2  0.000000 -1.000000  0.000000        0.00000            
+REMARK 290   SMTRY3   2  0.000000  0.000000  1.000000       23.55000            
+REMARK 290   SMTRY1   3 -1.000000  0.000000  0.000000        0.00000            
+REMARK 290   SMTRY2   3  0.000000  1.000000  0.000000       23.55000            
+REMARK 290   SMTRY3   3  0.000000  0.000000 -1.000000       23.55000            
+REMARK 290   SMTRY1   4  1.000000  0.000000  0.000000       23.55000            
+REMARK 290   SMTRY2   4  0.000000 -1.000000  0.000000       23.55000            
+REMARK 290   SMTRY3   4  0.000000  0.000000 -1.000000        0.00000            
+REMARK 290   SMTRY1   5  0.000000  0.000000  1.000000        0.00000            
+REMARK 290   SMTRY2   5  1.000000  0.000000  0.000000        0.00000            
+REMARK 290   SMTRY3   5  0.000000  1.000000  0.000000        0.00000            
+REMARK 290   SMTRY1   6  0.000000  0.000000  1.000000       23.55000            
+REMARK 290   SMTRY2   6 -1.000000  0.000000  0.000000       23.55000            
+REMARK 290   SMTRY3   6  0.000000 -1.000000  0.000000        0.00000            
+REMARK 290   SMTRY1   7  0.000000  0.000000 -1.000000       23.55000            
+REMARK 290   SMTRY2   7 -1.000000  0.000000  0.000000        0.00000            
+REMARK 290   SMTRY3   7  0.000000  1.000000  0.000000       23.55000            
+REMARK 290   SMTRY1   8  0.000000  0.000000 -1.000000        0.00000            
+REMARK 290   SMTRY2   8  1.000000  0.000000  0.000000       23.55000            
+REMARK 290   SMTRY3   8  0.000000 -1.000000  0.000000       23.55000            
+REMARK 290   SMTRY1   9  0.000000  1.000000  0.000000        0.00000            
+REMARK 290   SMTRY2   9  0.000000  0.000000  1.000000        0.00000            
+REMARK 290   SMTRY3   9  1.000000  0.000000  0.000000        0.00000            
+REMARK 290   SMTRY1  10  0.000000 -1.000000  0.000000        0.00000            
+REMARK 290   SMTRY2  10  0.000000  0.000000  1.000000       23.55000            
+REMARK 290   SMTRY3  10 -1.000000  0.000000  0.000000       23.55000            
+REMARK 290   SMTRY1  11  0.000000  1.000000  0.000000       23.55000            
+REMARK 290   SMTRY2  11  0.000000  0.000000 -1.000000       23.55000            
+REMARK 290   SMTRY3  11 -1.000000  0.000000  0.000000        0.00000            
+REMARK 290   SMTRY1  12  0.000000 -1.000000  0.000000       23.55000            
+REMARK 290   SMTRY2  12  0.000000  0.000000 -1.000000        0.00000            
+REMARK 290   SMTRY3  12  1.000000  0.000000  0.000000       23.55000            
+REMARK 290                                                                      
+REMARK 290 REMARK: NULL                                                         
+REMARK 300                                                                      
+REMARK 300 BIOMOLECULE: 1                                                       
+REMARK 300 SEE REMARK 350 FOR THE AUTHOR PROVIDED AND/OR PROGRAM                
+REMARK 300 GENERATED ASSEMBLY INFORMATION FOR THE STRUCTURE IN                  
+REMARK 300 THIS ENTRY. THE REMARK MAY ALSO PROVIDE INFORMATION ON               
+REMARK 300 BURIED SURFACE AREA.                                                 
+REMARK 350                                                                      
+REMARK 350 COORDINATES FOR A COMPLETE MULTIMER REPRESENTING THE KNOWN           
+REMARK 350 BIOLOGICALLY SIGNIFICANT OLIGOMERIZATION STATE OF THE                
+REMARK 350 MOLECULE CAN BE GENERATED BY APPLYING BIOMT TRANSFORMATIONS          
+REMARK 350 GIVEN BELOW.  BOTH NON-CRYSTALLOGRAPHIC AND                          
+REMARK 350 CRYSTALLOGRAPHIC OPERATIONS ARE GIVEN.                               
+REMARK 350                                                                      
+REMARK 350 BIOMOLECULE: 1                                                       
+REMARK 350 AUTHOR DETERMINED BIOLOGICAL UNIT: MONOMERIC                         
+REMARK 350 APPLY THE FOLLOWING TO CHAINS: A                                     
+REMARK 350   BIOMT1   1  1.000000  0.000000  0.000000        0.00000            
+REMARK 350   BIOMT2   1  0.000000  1.000000  0.000000        0.00000            
+REMARK 350   BIOMT3   1  0.000000  0.000000  1.000000        0.00000            
+REMARK 500                                                                      
+REMARK 500 GEOMETRY AND STEREOCHEMISTRY                                         
+REMARK 500 SUBTOPIC: COVALENT BOND LENGTHS                                      
+REMARK 500                                                                      
+REMARK 500 THE STEREOCHEMICAL PARAMETERS OF THE FOLLOWING RESIDUES              
+REMARK 500 HAVE VALUES WHICH DEVIATE FROM EXPECTED VALUES BY MORE               
+REMARK 500 THAN 6*RMSD (M=MODEL NUMBER; RES=RESIDUE NAME; C=CHAIN               
+REMARK 500 IDENTIFIER; SSEQ=SEQUENCE NUMBER; I=INSERTION CODE).                 
+REMARK 500                                                                      
+REMARK 500 STANDARD TABLE:                                                      
+REMARK 500 FORMAT: (10X,I3,1X,2(A3,1X,A1,I4,A1,1X,A4,3X),1X,F6.3)               
+REMARK 500                                                                      
+REMARK 500 EXPECTED VALUES PROTEIN: ENGH AND HUBER, 1999                        
+REMARK 500 EXPECTED VALUES NUCLEIC ACID: CLOWNEY ET AL 1996                     
+REMARK 500                                                                      
+REMARK 500  M RES CSSEQI ATM1   RES CSSEQI ATM2   DEVIATION                     
+REMARK 500    TYR A  10   CZ    TYR A  10   OH     -0.387                       
+REMARK 500    TRP A  25   CD1   TRP A  25   NE1     0.287                       
+REMARK 500    TRP A  25   NE1   TRP A  25   CE2     0.109                       
+REMARK 500                                                                      
+REMARK 500 REMARK: NULL                                                         
+REMARK 500                                                                      
+REMARK 500 GEOMETRY AND STEREOCHEMISTRY                                         
+REMARK 500 SUBTOPIC: COVALENT BOND ANGLES                                       
+REMARK 500                                                                      
+REMARK 500 THE STEREOCHEMICAL PARAMETERS OF THE FOLLOWING RESIDUES              
+REMARK 500 HAVE VALUES WHICH DEVIATE FROM EXPECTED VALUES BY MORE               
+REMARK 500 THAN 6*RMSD (M=MODEL NUMBER; RES=RESIDUE NAME; C=CHAIN               
+REMARK 500 IDENTIFIER; SSEQ=SEQUENCE NUMBER; I=INSERTION CODE).                 
+REMARK 500                                                                      
+REMARK 500 STANDARD TABLE:                                                      
+REMARK 500 FORMAT: (10X,I3,1X,A3,1X,A1,I4,A1,3(1X,A4,2X),12X,F5.1)              
+REMARK 500                                                                      
+REMARK 500 EXPECTED VALUES PROTEIN: ENGH AND HUBER, 1999                        
+REMARK 500 EXPECTED VALUES NUCLEIC ACID: CLOWNEY ET AL 1996                     
+REMARK 500                                                                      
+REMARK 500  M RES CSSEQI ATM1   ATM2   ATM3                                     
+REMARK 500    TRP A  25   CG  -  CD1 -  NE1 ANGL. DEV. =   6.7 DEGREES          
+REMARK 500    TRP A  25   CD1 -  NE1 -  CE2 ANGL. DEV. = -21.5 DEGREES          
+REMARK 500    TRP A  25   NE1 -  CE2 -  CZ2 ANGL. DEV. = -11.0 DEGREES          
+REMARK 500    TRP A  25   NE1 -  CE2 -  CD2 ANGL. DEV. =   9.6 DEGREES          
+REMARK 500                                                                      
+REMARK 500 REMARK: NULL                                                         
+REMARK 500                                                                      
+REMARK 500 GEOMETRY AND STEREOCHEMISTRY                                         
+REMARK 500 SUBTOPIC: TORSION ANGLES                                             
+REMARK 500                                                                      
+REMARK 500 TORSION ANGLES OUTSIDE THE EXPECTED RAMACHANDRAN REGIONS:            
+REMARK 500 (M=MODEL NUMBER; RES=RESIDUE NAME; C=CHAIN IDENTIFIER;               
+REMARK 500 SSEQ=SEQUENCE NUMBER; I=INSERTION CODE).                             
+REMARK 500                                                                      
+REMARK 500 STANDARD TABLE:                                                      
+REMARK 500 FORMAT:(10X,I3,1X,A3,1X,A1,I4,A1,4X,F7.2,3X,F7.2)                    
+REMARK 500                                                                      
+REMARK 500 EXPECTED VALUES: GJ KLEYWEGT AND TA JONES (1996). PHI/PSI-           
+REMARK 500 CHOLOGY: RAMACHANDRAN REVISITED. STRUCTURE 4, 1395 - 1400            
+REMARK 500                                                                      
+REMARK 500  M RES CSSEQI        PSI       PHI                                   
+REMARK 500    SER A   2      -57.57    -21.14                                   
+REMARK 500    THR A   5       54.62    -63.85                                   
+REMARK 500    SER A  11        9.62    -51.97                                   
+REMARK 500    MET A  27      -93.98   -145.30                                   
+REMARK 500    ASN A  28       64.02     15.67                                   
+REMARK 500                                                                      
+REMARK 500 REMARK: NULL                                                         
+REMARK 500                                                                      
+REMARK 500 GEOMETRY AND STEREOCHEMISTRY                                         
+REMARK 500 SUBTOPIC: PLANAR GROUPS                                              
+REMARK 500                                                                      
+REMARK 500 PLANAR GROUPS IN THE FOLLOWING RESIDUES HAVE A TOTAL                 
+REMARK 500 RMS DISTANCE OF ALL ATOMS FROM THE BEST-FIT PLANE                    
+REMARK 500 BY MORE THAN AN EXPECTED VALUE OF 6*RMSD, WITH AN                    
+REMARK 500 RMSD 0.02 ANGSTROMS, OR AT LEAST ONE ATOM HAS                        
+REMARK 500 AN RMSD GREATER THAN THIS VALUE                                      
+REMARK 500 (M=MODEL NUMBER; RES=RESIDUE NAME; C=CHAIN IDENTIFIER;               
+REMARK 500 SSEQ=SEQUENCE NUMBER; I=INSERTION CODE).                             
+REMARK 500                                                                      
+REMARK 500  M RES CSSEQI        RMS     TYPE                                    
+REMARK 500    ASN A  28         0.08    SIDE_CHAIN                              
+REMARK 500                                                                      
+REMARK 500 REMARK: NULL                                                         
+REMARK 500                                                                      
+REMARK 500 GEOMETRY AND STEREOCHEMISTRY                                         
+REMARK 500 SUBTOPIC: MAIN CHAIN PLANARITY                                       
+REMARK 500                                                                      
+REMARK 500 THE FOLLOWING RESIDUES HAVE A PSEUDO PLANARITY                       
+REMARK 500 TORSION, C(I) - CA(I) - N(I+1) - O(I), GREATER                       
+REMARK 500 10.0 DEGREES. (M=MODEL NUMBER; RES=RESIDUE NAME;                     
+REMARK 500 C=CHAIN IDENTIFIER; SSEQ=SEQUENCE NUMBER;                            
+REMARK 500 I=INSERTION CODE).                                                   
+REMARK 500                                                                      
+REMARK 500  M RES CSSEQI        ANGLE                                           
+REMARK 500    HIS A   1         19.48                                           
+REMARK 500    GLN A   3        -15.78                                           
+REMARK 500    GLY A   4        -17.23                                           
+REMARK 500    THR A   5        -10.38                                           
+REMARK 500    PHE A   6        -12.06                                           
+REMARK 500    THR A   7        -14.66                                           
+REMARK 500    SER A  11        -15.10                                           
+REMARK 500    LYS A  12         14.46                                           
+REMARK 500    ALA A  19        -10.92                                           
+REMARK 500    GLN A  20        -13.40                                           
+REMARK 500    VAL A  23        -15.87                                           
+REMARK 500    LEU A  26        -14.56                                           
+REMARK 500    MET A  27        -16.22                                           
+REMARK 500                                                                      
+REMARK 500 REMARK: NULL                                                         
+DBREF  1GCN A    1    29  UNP    P01274   GLUC_PIG        33     61             
+SEQRES   1 A   29  HIS SER GLN GLY THR PHE THR SER ASP TYR SER LYS TYR          
+SEQRES   2 A   29  LEU ASP SER ARG ARG ALA GLN ASP PHE VAL GLN TRP LEU          
+SEQRES   3 A   29  MET ASN THR                                                  
+HELIX    1   A PHE A    6  LEU A   26  1                                  21    
+CRYST1   47.100   47.100   47.100  90.00  90.00  90.00 P 21 3       12          
+ORIGX1      0.021231  0.000000  0.000000        0.00000                         
+ORIGX2      0.000000  0.021231  0.000000        0.00000                         
+ORIGX3      0.000000  0.000000  0.021231        0.00000                         
+SCALE1      0.021231  0.000000  0.000000        0.00000                         
+SCALE2      0.000000  0.021231  0.000000        0.00000                         
+SCALE3      0.000000  0.000000  0.021231        0.00000                         
+ATOM      1  N   HIS A   1      49.668  24.248  10.436  1.00 25.00           N  
+ATOM      2  CA  HIS A   1      50.197  25.578  10.784  1.00 16.00           C  
+ATOM      3  C   HIS A   1      49.169  26.701  10.917  1.00 16.00           C  
+ATOM      4  O   HIS A   1      48.241  26.524  11.749  1.00 16.00           O  
+ATOM      5  CB  HIS A   1      51.312  26.048   9.843  1.00 16.00           C  
+ATOM      6  CG  HIS A   1      50.958  26.068   8.340  1.00 16.00           C  
+ATOM      7  ND1 HIS A   1      49.636  26.144   7.860  1.00 16.00           N  
+ATOM      8  CD2 HIS A   1      51.797  26.043   7.286  1.00 16.00           C  
+ATOM      9  CE1 HIS A   1      49.691  26.152   6.454  1.00 17.00           C  
+ATOM     10  NE2 HIS A   1      51.046  26.090   6.098  1.00 17.00           N  
+ATOM     11  N   SER A   2      49.788  27.850  10.784  1.00 16.00           N  
+ATOM     12  CA  SER A   2      49.138  29.147  10.620  1.00 15.00           C  
+ATOM     13  C   SER A   2      47.713  29.006  10.110  1.00 15.00           C  
+ATOM     14  O   SER A   2      46.740  29.251  10.864  1.00 15.00           O  
+ATOM     15  CB  SER A   2      49.875  29.930   9.569  1.00 16.00           C  
+ATOM     16  OG  SER A   2      49.145  31.057   9.176  1.00 19.00           O  
+ATOM     17  N   GLN A   3      47.620  28.367   8.973  1.00 15.00           N  
+ATOM     18  CA  GLN A   3      46.287  28.193   8.308  1.00 14.00           C  
+ATOM     19  C   GLN A   3      45.406  27.172   8.963  1.00 14.00           C  
+ATOM     20  O   GLN A   3      44.198  27.508   9.014  1.00 14.00           O  
+ATOM     21  CB  GLN A   3      46.489  27.963   6.806  1.00 18.00           C  
+ATOM     22  CG  GLN A   3      45.138  27.800   6.111  1.00 21.00           C  
+ATOM     23  CD  GLN A   3      45.304  27.952   4.603  1.00 24.00           C  
+ATOM     24  OE1 GLN A   3      46.432  28.202   4.112  1.00 24.00           O  
+ATOM     25  NE2 GLN A   3      44.233  27.647   3.897  1.00 26.00           N  
+ATOM     26  N   GLY A   4      46.014  26.394   9.871  1.00 14.00           N  
+ATOM     27  CA  GLY A   4      45.422  25.287  10.680  1.00 14.00           C  
+ATOM     28  C   GLY A   4      43.892  25.215  10.719  1.00 14.00           C  
+ATOM     29  O   GLY A   4      43.287  26.155  11.288  1.00 14.00           O  
+ATOM     30  N   THR A   5      43.406  23.993  10.767  1.00 14.00           N  
+ATOM     31  CA  THR A   5      42.004  23.642  10.443  1.00 12.00           C  
+ATOM     32  C   THR A   5      40.788  24.146  11.252  1.00 12.00           C  
+ATOM     33  O   THR A   5      39.804  23.384  11.410  1.00 12.00           O  
+ATOM     34  CB  THR A   5      41.934  22.202   9.889  1.00 14.00           C  
+ATOM     35  OG1 THR A   5      41.080  21.317  10.609  1.00 15.00           O  
+ATOM     36  CG2 THR A   5      43.317  21.556   9.849  1.00 15.00           C  
+ATOM     37  N   PHE A   6      40.628  25.463  11.441  1.00 12.00           N  
+ATOM     38  CA  PHE A   6      39.381  25.950  12.104  1.00 12.00           C  
+ATOM     39  C   PHE A   6      38.156  25.684  11.232  1.00 12.00           C  
+ATOM     40  O   PHE A   6      37.231  25.002  11.719  1.00 12.00           O  
+ATOM     41  CB  PHE A   6      39.407  27.425  12.584  1.00 12.00           C  
+ATOM     42  CG  PHE A   6      38.187  27.923  13.430  1.00 12.00           C  
+ATOM     43  CD1 PHE A   6      36.889  27.518  13.163  1.00 12.00           C  
+ATOM     44  CD2 PHE A   6      38.386  28.862  14.419  1.00 12.00           C  
+ATOM     45  CE1 PHE A   6      35.813  27.967  13.909  1.00 12.00           C  
+ATOM     46  CE2 PHE A   6      37.306  29.328  15.177  1.00 12.00           C  
+ATOM     47  CZ  PHE A   6      36.019  28.871  14.928  1.00 12.00           C  
+ATOM     48  N   THR A   7      38.341  25.794   9.956  1.00 12.00           N  
+ATOM     49  CA  THR A   7      37.249  25.666   8.991  1.00 12.00           C  
+ATOM     50  C   THR A   7      36.324  24.452   9.101  1.00 12.00           C  
+ATOM     51  O   THR A   7      35.111  24.637   9.387  1.00 12.00           O  
+ATOM     52  CB  THR A   7      37.884  25.743   7.628  1.00 13.00           C  
+ATOM     53  OG1 THR A   7      37.940  27.122   7.317  1.00 14.00           O  
+ATOM     54  CG2 THR A   7      37.073  25.003   6.585  1.00 14.00           C  
+ATOM     55  N   SER A   8      36.964  23.356   9.442  1.00 12.00           N  
+ATOM     56  CA  SER A   8      36.286  22.063   9.486  1.00 12.00           C  
+ATOM     57  C   SER A   8      35.575  21.813  10.813  1.00 11.00           C  
+ATOM     58  O   SER A   8      35.203  20.650  11.111  1.00 10.00           O  
+ATOM     59  CB  SER A   8      37.291  20.958   9.189  1.00 16.00           C  
+ATOM     60  OG  SER A   8      37.917  21.247   7.943  1.00 20.00           O  
+ATOM     61  N   ASP A   9      35.723  22.783  11.694  1.00 10.00           N  
+ATOM     62  CA  ASP A   9      35.004  22.803  12.977  1.00 10.00           C  
+ATOM     63  C   ASP A   9      33.532  23.121  12.749  1.00 10.00           C  
+ATOM     64  O   ASP A   9      32.645  22.360  13.210  1.00 10.00           O  
+ATOM     65  CB  ASP A   9      35.556  23.874  13.919  1.00 11.00           C  
+ATOM     66  CG  ASP A   9      36.280  23.230  15.096  1.00 13.00           C  
+ATOM     67  OD1 ASP A   9      36.088  22.010  15.324  1.00 16.00           O  
+ATOM     68  OD2 ASP A   9      36.821  23.974  15.951  1.00 16.00           O  
+ATOM     69  N   TYR A  10      33.316  24.220  12.040  1.00 10.00           N  
+ATOM     70  CA  TYR A  10      31.967  24.742  11.748  1.00 10.00           C  
+ATOM     71  C   TYR A  10      31.203  23.973  10.685  1.00 10.00           C  
+ATOM     72  O   TYR A  10      29.980  23.772  10.885  1.00 10.00           O  
+ATOM     73  CB  TYR A  10      31.951  26.230  11.367  1.00 10.00           C  
+ATOM     74  CG  TYR A  10      30.613  26.678  10.713  1.00 10.00           C  
+ATOM     75  CD1 TYR A  10      30.563  26.886   9.350  1.00 10.00           C  
+ATOM     76  CD2 TYR A  10      29.463  26.824  11.461  1.00 10.00           C  
+ATOM     77  CE1 TYR A  10      29.377  27.275   8.733  1.00 10.00           C  
+ATOM     78  CE2 TYR A  10      28.272  27.214  10.848  1.00 10.00           C  
+ATOM     79  CZ  TYR A  10      28.226  27.452   9.483  1.00 10.00           C  
+ATOM     80  OH  TYR A  10      27.365  27.683   9.060  1.00 11.00           O  
+ATOM     81  N   SER A  11      31.796  23.909   9.491  1.00 10.00           N  
+ATOM     82  CA  SER A  11      31.146  23.418   8.250  1.00 10.00           C  
+ATOM     83  C   SER A  11      30.463  22.048   8.303  1.00 10.00           C  
+ATOM     84  O   SER A  11      29.615  21.759   7.422  1.00 10.00           O  
+ATOM     85  CB  SER A  11      32.004  23.615   6.998  1.00 14.00           C  
+ATOM     86  OG  SER A  11      32.013  24.995   6.632  1.00 19.00           O  
+ATOM     87  N   LYS A  12      30.402  21.619   9.544  1.00 10.00           N  
+ATOM     88  CA  LYS A  12      29.792  20.460  10.189  1.00  9.00           C  
+ATOM     89  C   LYS A  12      28.494  20.817  10.932  1.00  9.00           C  
+ATOM     90  O   LYS A  12      27.597  19.943  10.980  1.00  9.00           O  
+ATOM     91  CB  LYS A  12      30.811  20.013  11.224  1.00 10.00           C  
+ATOM     92  CG  LYS A  12      30.482  18.661  11.833  1.00 14.00           C  
+ATOM     93  CD  LYS A  12      31.413  18.365  12.999  1.00 18.00           C  
+ATOM     94  CE  LYS A  12      31.243  16.937  13.498  1.00 22.00           C  
+ATOM     95  NZ  LYS A  12      32.121  16.717  14.652  1.00 26.00           N  
+ATOM     96  N   TYR A  13      28.583  21.742  11.894  1.00  9.00           N  
+ATOM     97  CA  TYR A  13      27.396  22.283  12.612  1.00  8.00           C  
+ATOM     98  C   TYR A  13      26.214  22.497  11.670  1.00  8.00           C  
+ATOM     99  O   TYR A  13      25.037  22.245  12.029  1.00  8.00           O  
+ATOM    100  CB  TYR A  13      27.730  23.578  13.385  1.00  8.00           C  
+ATOM    101  CG  TYR A  13      26.516  24.500  13.692  1.00  8.00           C  
+ATOM    102  CD1 TYR A  13      25.798  24.377  14.867  1.00  8.00           C  
+ATOM    103  CD2 TYR A  13      26.185  25.498  12.796  1.00  8.00           C  
+ATOM    104  CE1 TYR A  13      24.713  25.228  15.120  1.00  8.00           C  
+ATOM    105  CE2 TYR A  13      25.108  26.342  13.035  1.00  8.00           C  
+ATOM    106  CZ  TYR A  13      24.370  26.210  14.196  1.00  8.00           C  
+ATOM    107  OH  TYR A  13      23.202  26.933  14.347  1.00 10.00           O  
+ATOM    108  N   LEU A  14      26.522  22.993  10.494  1.00  8.00           N  
+ATOM    109  CA  LEU A  14      25.461  23.263   9.523  1.00  8.00           C  
+ATOM    110  C   LEU A  14      24.912  21.978   8.907  1.00  8.00           C  
+ATOM    111  O   LEU A  14      24.122  22.025   7.933  1.00  8.00           O  
+ATOM    112  CB  LEU A  14      25.923  24.242   8.447  1.00 13.00           C  
+ATOM    113  CG  LEU A  14      25.064  25.509   8.412  1.00 19.00           C  
+ATOM    114  CD1 LEU A  14      25.564  26.496   7.505  1.00 25.00           C  
+ATOM    115  CD2 LEU A  14      23.582  25.209   8.199  1.00 25.00           C  
+ATOM    116  N   ASP A  15      25.556  20.886   9.263  1.00  8.00           N  
+ATOM    117  CA  ASP A  15      25.075  19.552   8.885  1.00  8.00           C  
+ATOM    118  C   ASP A  15      24.208  19.002  10.009  1.00  8.00           C  
+ATOM    119  O   ASP A  15      23.550  17.940   9.861  1.00  8.00           O  
+ATOM    120  CB  ASP A  15      26.246  18.601   8.644  1.00 11.00           C  
+ATOM    121  CG  ASP A  15      26.260  18.121   7.196  1.00 16.00           C  
+ATOM    122  OD1 ASP A  15      26.021  18.946   6.280  1.00 21.00           O  
+ATOM    123  OD2 ASP A  15      26.732  16.984   6.946  1.00 21.00           O  
+ATOM    124  N   SER A  16      24.015  19.861  10.986  1.00  8.00           N  
+ATOM    125  CA  SER A  16      23.180  19.548  12.149  1.00  7.00           C  
+ATOM    126  C   SER A  16      21.923  20.414  12.167  1.00  7.00           C  
+ATOM    127  O   SER A  16      20.841  19.941  12.598  1.00  7.00           O  
+ATOM    128  CB  SER A  16      23.981  19.746  13.437  1.00  9.00           C  
+ATOM    129  OG  SER A  16      23.327  19.102  14.524  1.00 11.00           O  
+ATOM    130  N   ARG A  17      22.037  21.605  11.597  1.00  7.00           N  
+ATOM    131  CA  ARG A  17      20.875  22.504  11.583  1.00  6.00           C  
+ATOM    132  C   ARG A  17      19.868  22.156  10.491  1.00  6.00           C  
+ATOM    133  O   ARG A  17      18.665  22.015  10.809  1.00  6.00           O  
+ATOM    134  CB  ARG A  17      21.214  23.997  11.557  1.00  7.00           C  
+ATOM    135  CG  ARG A  17      20.010  24.800  12.063  1.00  9.00           C  
+ATOM    136  CD  ARG A  17      19.570  25.929  11.132  1.00 11.00           C  
+ATOM    137  NE  ARG A  17      20.149  27.218  11.537  1.00 12.00           N  
+ATOM    138  CZ  ARG A  17      19.828  28.351  10.936  1.00 13.00           C  
+ATOM    139  NH1 ARG A  17      19.319  28.304   9.720  1.00 14.00           N  
+ATOM    140  NH2 ARG A  17      20.351  29.485  11.362  1.00 14.00           N  
+ATOM    141  N   ARG A  18      20.378  21.725   9.348  1.00  6.00           N  
+ATOM    142  CA  ARG A  18      19.530  21.258   8.235  1.00  5.00           C  
+ATOM    143  C   ARG A  18      19.148  19.796   8.478  1.00  5.00           C  
+ATOM    144  O   ARG A  18      18.326  19.189   7.741  1.00  5.00           O  
+ATOM    145  CB  ARG A  18      20.237  21.481   6.888  1.00  8.00           C  
+ATOM    146  CG  ARG A  18      19.384  21.236   5.634  1.00  9.00           C  
+ATOM    147  CD  ARG A  18      19.623  19.860   5.005  1.00 11.00           C  
+ATOM    148  NE  ARG A  18      20.029  19.997   3.600  1.00 12.00           N  
+ATOM    149  CZ  ARG A  18      19.398  19.415   2.597  1.00 13.00           C  
+ATOM    150  NH1 ARG A  18      18.483  18.493   2.835  1.00 14.00           N  
+ATOM    151  NH2 ARG A  18      19.831  19.597   1.364  1.00 14.00           N  
+ATOM    152  N   ALA A  19      19.560  19.319   9.623  1.00  6.00           N  
+ATOM    153  CA  ALA A  19      19.126  17.991  10.053  1.00  6.00           C  
+ATOM    154  C   ALA A  19      18.002  18.136  11.071  1.00  6.00           C  
+ATOM    155  O   ALA A  19      16.933  17.494  10.922  1.00  7.00           O  
+ATOM    156  CB  ALA A  19      20.285  17.187  10.629  1.00 15.00           C  
+ATOM    157  N   GLN A  20      18.094  19.241  11.783  1.00  7.00           N  
+ATOM    158  CA  GLN A  20      17.013  19.632  12.689  1.00  7.00           C  
+ATOM    159  C   GLN A  20      15.897  20.314  11.905  1.00  7.00           C  
+ATOM    160  O   GLN A  20      14.701  20.031  12.162  1.00  7.00           O  
+ATOM    161  CB  GLN A  20      17.513  20.538  13.821  1.00 11.00           C  
+ATOM    162  CG  GLN A  20      16.699  21.829  13.936  1.00 16.00           C  
+ATOM    163  CD  GLN A  20      16.591  22.277  15.393  1.00 22.00           C  
+ATOM    164  OE1 GLN A  20      17.533  22.060  16.194  1.00 24.00           O  
+ATOM    165  NE2 GLN A  20      15.356  22.544  15.773  1.00 24.00           N  
+ATOM    166  N   ASP A  21      16.292  20.724  10.714  1.00  7.00           N  
+ATOM    167  CA  ASP A  21      15.405  21.490   9.835  1.00  7.00           C  
+ATOM    168  C   ASP A  21      14.451  20.565   9.120  1.00  7.00           C  
+ATOM    169  O   ASP A  21      13.245  20.850   8.962  1.00  7.00           O  
+ATOM    170  CB  ASP A  21      16.212  22.278   8.809  1.00 14.00           C  
+ATOM    171  CG  ASP A  21      15.427  23.525   8.413  1.00 21.00           C  
+ATOM    172  OD1 ASP A  21      15.031  24.298   9.321  1.00 28.00           O  
+ATOM    173  OD2 ASP A  21      15.316  23.827   7.200  1.00 28.00           O  
+ATOM    174  N   PHE A  22      14.987  19.373   8.843  1.00  7.00           N  
+ATOM    175  CA  PHE A  22      14.216  18.253   8.289  1.00  7.00           C  
+ATOM    176  C   PHE A  22      13.098  17.860   9.246  1.00  7.00           C  
+ATOM    177  O   PHE A  22      11.956  17.556   8.818  1.00  7.00           O  
+ATOM    178  CB  PHE A  22      15.134  17.038   8.105  1.00  8.00           C  
+ATOM    179  CG  PHE A  22      14.349  15.761   7.724  1.00 10.00           C  
+ATOM    180  CD1 PHE A  22      14.022  15.527   6.410  1.00 12.00           C  
+ATOM    181  CD2 PHE A  22      13.992  14.842   8.689  1.00 12.00           C  
+ATOM    182  CE1 PHE A  22      13.302  14.391   6.050  1.00 14.00           C  
+ATOM    183  CE2 PHE A  22      13.269  13.708   8.340  1.00 14.00           C  
+ATOM    184  CZ  PHE A  22      12.917  13.483   7.018  1.00 16.00           C  
+ATOM    185  N   VAL A  23      13.455  17.883  10.517  1.00  7.00           N  
+ATOM    186  CA  VAL A  23      12.574  17.403  11.589  1.00  7.00           C  
+ATOM    187  C   VAL A  23      11.283  18.205  11.729  1.00  7.00           C  
+ATOM    188  O   VAL A  23      10.233  17.600  12.052  1.00  7.00           O  
+ATOM    189  CB  VAL A  23      13.339  17.278  12.906  1.00 10.00           C  
+ATOM    190  CG1 VAL A  23      12.441  17.004  14.108  1.00 13.00           C  
+ATOM    191  CG2 VAL A  23      14.455  16.248  12.794  1.00 13.00           C  
+ATOM    192  N   GLN A  24      11.255  19.253  10.941  1.00  8.00           N  
+ATOM    193  CA  GLN A  24      10.082  20.114  10.818  1.00  8.00           C  
+ATOM    194  C   GLN A  24       9.158  19.638   9.692  1.00  8.00           C  
+ATOM    195  O   GLN A  24       7.959  19.990   9.663  1.00  8.00           O  
+ATOM    196  CB  GLN A  24      10.575  21.521  10.498  1.00 14.00           C  
+ATOM    197  CG  GLN A  24       9.505  22.591  10.661  1.00 20.00           C  
+ATOM    198  CD  GLN A  24       9.964  23.862   9.956  1.00 26.00           C  
+ATOM    199  OE1 GLN A  24      10.079  24.941  10.587  1.00 32.00           O  
+ATOM    200  NE2 GLN A  24      10.086  23.739   8.649  1.00 32.00           N  
+ATOM    201  N   TRP A  25       9.723  19.074   8.651  1.00  8.00           N  
+ATOM    202  CA  TRP A  25       8.899  18.676   7.495  1.00  9.00           C  
+ATOM    203  C   TRP A  25       8.118  17.395   7.751  1.00  9.00           C  
+ATOM    204  O   TRP A  25       6.860  17.395   7.725  1.00  9.00           O  
+ATOM    205  CB  TRP A  25       9.761  18.442   6.262  1.00 11.00           C  
+ATOM    206  CG  TRP A  25       8.871  18.331   5.004  1.00 12.00           C  
+ATOM    207  CD1 TRP A  25       8.097  19.279   4.442  1.00 12.00           C  
+ATOM    208  CD2 TRP A  25       8.640  17.180   4.249  1.00 12.00           C  
+ATOM    209  NE1 TRP A  25       7.041  18.780   3.259  1.00 12.00           N  
+ATOM    210  CE2 TRP A  25       7.873  17.564   3.121  1.00 12.00           C  
+ATOM    211  CE3 TRP A  25       9.124  15.884   4.378  1.00 12.00           C  
+ATOM    212  CZ2 TRP A  25       7.726  16.765   2.003  1.00 12.00           C  
+ATOM    213  CZ3 TRP A  25       8.870  15.038   3.296  1.00 12.00           C  
+ATOM    214  CH2 TRP A  25       8.216  15.469   2.140  1.00 12.00           C  
+ATOM    215  N   LEU A  26       8.857  16.484   8.346  1.00  9.00           N  
+ATOM    216  CA  LEU A  26       8.377  15.159   8.741  1.00 10.00           C  
+ATOM    217  C   LEU A  26       7.534  15.279  10.012  1.00 11.00           C  
+ATOM    218  O   LEU A  26       6.755  14.347  10.331  1.00 11.00           O  
+ATOM    219  CB  LEU A  26       9.611  14.267   8.924  1.00 10.00           C  
+ATOM    220  CG  LEU A  26       9.342  12.810   9.303  1.00 10.00           C  
+ATOM    221  CD1 LEU A  26       8.223  12.149   8.505  1.00 10.00           C  
+ATOM    222  CD2 LEU A  26      10.637  11.982   9.250  1.00 10.00           C  
+ATOM    223  N   MET A  27       7.281  16.544  10.320  1.00 11.00           N  
+ATOM    224  CA  MET A  27       6.446  16.959  11.451  1.00 11.00           C  
+ATOM    225  C   MET A  27       5.607  18.227  11.219  1.00 13.00           C  
+ATOM    226  O   MET A  27       4.823  18.240  10.244  1.00 13.00           O  
+ATOM    227  CB  MET A  27       7.327  17.118  12.679  1.00 11.00           C  
+ATOM    228  CG  MET A  27       6.518  17.289  13.953  1.00 11.00           C  
+ATOM    229  SD  MET A  27       7.301  18.326  15.196  1.00 11.00           S  
+ATOM    230  CE  MET A  27       5.833  18.677  16.178  1.00 11.00           C  
+ATOM    231  N   ASN A  28       6.147  19.366  11.620  1.00 14.00           N  
+ATOM    232  CA  ASN A  28       5.399  20.637  11.728  1.00 14.00           C  
+ATOM    233  C   ASN A  28       3.878  20.587  11.716  1.00 17.00           C  
+ATOM    234  O   ASN A  28       3.252  21.114  10.763  1.00 19.00           O  
+ATOM    235  CB  ASN A  28       5.874  21.774  10.843  1.00 14.00           C  
+ATOM    236  CG  ASN A  28       6.246  22.905  11.791  1.00 14.00           C  
+ATOM    237  OD1 ASN A  28       6.929  22.629  12.807  1.00 14.00           O  
+ATOM    238  ND2 ASN A  28       6.271  24.085  11.229  1.00 14.00           N  
+ATOM    239  N   THR A  29       3.391  19.940  12.762  1.00 21.00           N  
+ATOM    240  CA  THR A  29       2.014  19.761  13.283  1.00 21.00           C  
+ATOM    241  C   THR A  29       0.826  19.943  12.332  1.00 23.00           C  
+ATOM    242  O   THR A  29       0.932  19.600  11.133  1.00 30.00           O  
+ATOM    243  CB  THR A  29       1.845  20.667  14.505  1.00 21.00           C  
+ATOM    244  OG1 THR A  29       1.214  21.893  14.153  1.00 21.00           O  
+ATOM    245  CG2 THR A  29       3.180  20.968  15.185  1.00 21.00           C  
+ATOM    246  OXT THR A  29      -0.317  20.109  12.824  1.00 25.00           O  
+TER     247      THR A  29                                                      
+MASTER      344    1    0    1    0    0    0    6  246    1    0    3          
+END                                                                             
--- a/alphafold/data/mmcif_parsing.py
+++ b/alphafold/data/mmcif_parsing.py
@@ -315,6 +315,7 @@ def _get_header(parsed_info: MmCIFDict) -> PdbHeader:
      try:
        raw_resolution = parsed_info[res_key][0]
        header['resolution'] = float(raw_resolution)
+        break
      except ValueError:
        logging.debug('Invalid resolution format: %s', parsed_info[res_key])

--- a/alphafold/data/msa_pairing.py
+++ b/alphafold/data/msa_pairing.py
@@ -15,9 +15,7 @@
 """Pairing logic for multimer data pipeline."""
 import collections
-import functools
+from typing import cast, Dict, Iterable, List, Sequence
-import string
-from typing import Any, Dict, Iterable, List, Sequence
 from alphafold.common import residue_constants
 from alphafold.data import pipeline
@@ -135,7 +133,7 @@ def _create_species_dict(msa_df: pd.DataFrame) -> Dict[bytes, pd.DataFrame]:
  """Creates mapping from species to msa dataframe of that species."""
  species_lookup = {}
  for species, species_df in msa_df.groupby('msa_species_identifiers'):
-    species_lookup[species] = species_df
+    species_lookup[cast(bytes, species)] = species_df
  return species_lookup

--- a/alphafold/data/templates.py
+++ b/alphafold/data/templates.py
@@ -449,6 +449,7 @@ def _get_atom_positions(
    mask = np.zeros([residue_constants.atom_type_num], dtype=np.float32)
    res_at_position = mmcif_object.seqres_to_structure[auth_chain_id][res_index]
    if not res_at_position.is_missing:
+      assert res_at_position.position is not None
      res = chain[(res_at_position.hetflag,
                   res_at_position.position.residue_number,
                   res_at_position.position.insertion_code)]

--- a/alphafold/model/folding_multimer.py
+++ b/alphafold/model/folding_multimer.py
@@ -775,7 +775,7 @@ def compute_atom14_gt(
  gt_mask = (1. - use_alt) * gt_mask + use_alt * alt_gt_mask
  gt_positions = (1. - use_alt) * gt_positions + use_alt * alt_gt_positions
-  return gt_positions, alt_gt_mask, alt_naming_is_better
+  return gt_positions, gt_mask, alt_naming_is_better
 def backbone_loss(gt_rigid: geometry.Rigid3Array,

--- a/alphafold/model/geometry/test_utils.py
+++ b/alphafold/model/geometry/test_utils.py
@@ -61,9 +61,9 @@ def assert_vectors_equal(vec1: vector.Vec3Array, vec2: vector.Vec3Array):
 def assert_vectors_close(vec1: vector.Vec3Array, vec2: vector.Vec3Array):
-  np.testing.assert_allclose(vec1.x, vec2.x, atol=1e-6, rtol=0.)
+  np.testing.assert_allclose(vec1.x, vec2.x, atol=1e-5, rtol=0.)
-  np.testing.assert_allclose(vec1.y, vec2.y, atol=1e-6, rtol=0.)
+  np.testing.assert_allclose(vec1.y, vec2.y, atol=1e-5, rtol=0.)
-  np.testing.assert_allclose(vec1.z, vec2.z, atol=1e-6, rtol=0.)
+  np.testing.assert_allclose(vec1.z, vec2.z, atol=1e-5, rtol=0.)
 def assert_array_close_to_vector(array: jnp.ndarray, vec: vector.Vec3Array):

--- a/alphafold/model/prng_test.py
+++ b/alphafold/model/prng_test.py
@@ -29,8 +29,7 @@ class PrngTest(absltest.TestCase):
    raw_key = safe_key.get()
-    self.assertNotEqual(raw_key[0], init_key[0])
+    self.assertFalse((raw_key == init_key).all())
-    self.assertNotEqual(raw_key[1], init_key[1])
    with self.assertRaises(RuntimeError):
      safe_key.get()

--- a/alphafold/model/utils.py
+++ b/alphafold/model/utils.py
@@ -160,8 +160,14 @@ def padding_consistent_rng(f):
    return jax.vmap(functools.partial(grid_keys, shape=shape[1:]))(new_keys)
  def inner(key, shape, **kwargs):
+    keys = grid_keys(key, shape)
+    signature = (
+        '()->()'
+        if jax.dtypes.issubdtype(keys.dtype, jax.dtypes.prng_key)
+        else '(2)->()'
+    )
    return jnp.vectorize(
-        lambda key: f(key, shape=(), **kwargs),
+        functools.partial(f, shape=(), **kwargs), signature=signature
-        signature='(2)->()')(
+    )(keys)
-            grid_keys(key, shape))
  return inner
--- a/alphafold/notebooks/notebook_utils.py
+++ b/alphafold/notebooks/notebook_utils.py
@@ -13,7 +13,6 @@
 # limitations under the License.
 """Helper methods for the AlphaFold Colab notebook."""
-import json
 from typing import AbstractSet, Any, Mapping, Optional, Sequence
 from alphafold.common import residue_constants
@@ -143,31 +142,6 @@ def empty_placeholder_template_features(
  }
-def get_pae_json(pae: np.ndarray, max_pae: float) -> str:
-  """Returns the PAE in the same format as is used in the AFDB.
-  Note that the values are presented as floats to 1 decimal place,
-  whereas AFDB returns integer values.
-  Args:
-    pae: The n_res x n_res PAE array.
-    max_pae: The maximum possible PAE value.
-  Returns:
-    PAE output format as a JSON string.
-  """
-  # Check the PAE array is the correct shape.
-  if (pae.ndim != 2 or pae.shape[0] != pae.shape[1]):
-    raise ValueError(f'PAE must be a square matrix, got {pae.shape}')
-  # Round the predicted aligned errors to 1 decimal place.
-  rounded_errors = np.round(pae.astype(np.float64), decimals=1)
-  formatted_output = [{
-      'predicted_aligned_error': rounded_errors.tolist(),
-      'max_predicted_aligned_error': max_pae
-  }]
-  return json.dumps(formatted_output, indent=None, separators=(',', ':'))
 def check_cell_execution_order(
    cells_ran: AbstractSet[int], cell_number: int) -> None:
  """Check that the cell execution order is correct.