Commit 15cd3506 authored by mashun1's avatar mashun1
Browse files

Merge branch 'dtk24.04.1'

parents 24e633dc 19085464
*.egg*
tryme.ipynb
build/
dist/
test/
temp_output/
temp_fasta/
*pycache*
FROM image.sourcefind.cn:5000/dcu/admin/base/jax:0.4.23-ubuntu20.04-dtk24.04.1-py3.10
# RUN apt update
# WORKDIR /app
# WORKDIR /app/softwares
# RUN git clone https://github.com/soedinglab/hh-suite.git
# RUN mkdir -p hh-suite/build && cd hh-suite/build && cmake -DCMAKE_INSTALL_PREFIX=. .. && make -j 4 && make install
# ENV PATH=/app/softwares/hh-suite/build/bin:/app/softwares/hh-suite/build/scripts:$PATH
# WORKDIR /app/softwares
# RUN wget https://github.com/TimoLassmann/kalign/archive/refs/tags/v3.4.0.zip && unzip v3.4.0.zip && cd kalign-3.4.0 && mkdir build && cd build && cmake .. && make && make install
# WORKDIR /app/softwares
# RUN sudo apt install doxygen -y
# RUN wget https://github.com/openmm/openmm/archive/refs/tags/8.0.0.zip && unzip 8.0.0.zip && cd openmm-8.0.0 && mkdir build && cd build && cmake .. && make && sudo make install && sudo make PythonInstall
# WORKDIR /app/softwares
# RUN wget https://github.com/openmm/pdbfixer/archive/refs/tags/1.9.zip && unzip 1.9.zip && cd pdbfixer-1.9 && python setup.py install
# RUN sudo apt install hmmer -y
# WORKDIR /app
# COPY . /app/alphafold2
# RUN ls
# RUN pip install --no-cache-dir -r /app/alphafold2/requirements_dcu.txt -i https://mirrors.ustc.edu.cn/pypi/web/simple
# RUN pip install dm-haiku==0.0.11 flax==0.7.1 jmp==0.0.2 tabulate==0.8.9 --no-deps jax -i https://mirrors.ustc.edu.cn/pypi/web/simple
# RUN pip install orbax==0.1.6 orbax-checkpoint==0.1.6 optax==0.2.2 -i https://mirrors.ustc.edu.cn/pypi/web/simple
# WORKDIR /app/alphafold2
# RUN python setup.py install
<!--
* @Author: zhuww
* @email: zhuww@sugon.com
* @Date: 2023-04-06 18:04:07
* @LastEditTime: 2023-12-26 15:54:01
-->
# AF2 # AF2
## 论文 ## 论文
- [https://www.nature.com/articles/s41586-021-03819-2](https://www.nature.com/articles/s41586-021-03819-2) - [https://www.nature.com/articles/s41586-021-03819-2](https://www.nature.com/articles/s41586-021-03819-2)
...@@ -19,9 +14,17 @@ AlphaFold2通过从蛋白质序列和结构数据中提取信息,使用神经 ...@@ -19,9 +14,17 @@ AlphaFold2通过从蛋白质序列和结构数据中提取信息,使用神经
![img](./docs/alphafold2_1.png) ![img](./docs/alphafold2_1.png)
<!-- ## 环境配置 ## 环境配置
### Docker(方法一)
# 使用该方法不需要下载本仓库,镜像中已包含可运行代码,但需要挂载相应的数据文件
### Docker docker pull image.sourcefind.cn:5000/dcu/admin/base/custom:alphafold2-dtk24.04.1-py310
docker run --shm-size 100g --network=host --name=alphafold2 --privileged --device=/dev/kfd --device=/dev/dri --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -v 本地数据地址:镜像数据地址 -v /opt/hyhal:/opt/hyhal:ro -it <your IMAGE ID> bash
### Docker(方法二)
docker pull image.sourcefind.cn:5000/dcu/admin/base/jax:0.4.23-ubuntu20.04-dtk24.04.1-py3.10 docker pull image.sourcefind.cn:5000/dcu/admin/base/jax:0.4.23-ubuntu20.04-dtk24.04.1-py3.10
...@@ -45,7 +48,7 @@ AlphaFold2通过从蛋白质序列和结构数据中提取信息,使用神经 ...@@ -45,7 +48,7 @@ AlphaFold2通过从蛋白质序列和结构数据中提取信息,使用神经
export PATH="$(pwd)/bin:$(pwd)/scripts:$PATH" export PATH="$(pwd)/bin:$(pwd)/scripts:$PATH"
wget https://github.com/TimoLassmann/kalign/archive/refs/tags/v3.4.0.zip wget https://github.com/TimoLassmann/kalign/archive/refs/tags/v3.4.0.zip
unzip 3.4.0.zip && cd kalign-3.4.0 unzip v3.4.0.zip && cd kalign-3.4.0
mkdir build mkdir build
cd build cd build
cmake .. cmake ..
...@@ -65,23 +68,8 @@ AlphaFold2通过从蛋白质序列和结构数据中提取信息,使用神经 ...@@ -65,23 +68,8 @@ AlphaFold2通过从蛋白质序列和结构数据中提取信息,使用神经
wget https://github.com/openmm/pdbfixer/archive/refs/tags/1.9.zip wget https://github.com/openmm/pdbfixer/archive/refs/tags/1.9.zip
unzip 1.9.zip && cd pdbfixer-1.9 && python setup.py install --> unzip 1.9.zip && cd pdbfixer-1.9 && python setup.py install
## 环境配置
提供[光源](https://www.sourcefind.cn/#/image/dcu/custom)拉取推理的docker镜像:
```
docker pull image.sourcefind.cn:5000/dcu/admin/base/custom:alphafold2-2.3.2-dtk23.10-py38
# <Image ID>用上面拉取docker镜像的ID替换
# <Host Path>主机端路径
# <Container Path>容器映射路径
docker run -it --name alphafold --privileged --shm-size=32G --device=/dev/kfd --device=/dev/dri/ --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --ulimit memlock=-1:-1 --ipc=host --network host --group-add video -v <Host Path>:<Container Path> <Image ID> /bin/bash
```
镜像版本依赖:
* DTK驱动:dtk23.10
* Jax: 0.3.25
* TensorFlow2: 2.11.0
* python: python3.8
## 数据集 ## 数据集
推荐使用AlphaFold2中的开源数据集,包括BFD、MGnify、PDB70、Uniclust、Uniref90等,数据集大小约2.62TB。数据集格式如下: 推荐使用AlphaFold2中的开源数据集,包括BFD、MGnify、PDB70、Uniclust、Uniref90等,数据集大小约2.62TB。数据集格式如下:
...@@ -171,12 +159,12 @@ $DOWNLOAD_DIR/ ...@@ -171,12 +159,12 @@ $DOWNLOAD_DIR/
``` ```
[查看蛋白质3D结构](https://www.pdbus.org/3d-view) [查看蛋白质3D结构](https://www.pdbus.org/3d-view)
<div style="display: flex; justify-content: center; align-items: center;">
<img src="./docs/result_pdb.png" alt="Image"> ID: 8U23
<div style="position: absolute; top: 50%; left: 50%; transform: translate(-50%, -50%); background: rgba(0, 0, 0, 0.5); color: #fff; padding: 10px;">
红色为真实结构,色为预测结构 蓝色的为预测结构,色为真实结构
</div>
</div> ![alt text](image.png)
### 精度 ### 精度
测试数据:[casp15](https://www.predictioncenter.org/casp15/targetlist.cgi)[uniprot](https://www.uniprot.org/) 测试数据:[casp15](https://www.predictioncenter.org/casp15/targetlist.cgi)[uniprot](https://www.uniprot.org/)
...@@ -196,6 +184,8 @@ $DOWNLOAD_DIR/ ...@@ -196,6 +184,8 @@ $DOWNLOAD_DIR/
| fp32 | 单体 | T1024 | 408 | 0.664 | 0.470 | 87.076 | 0.829 | 0.518 | 3.516 | | fp32 | 单体 | T1024 | 408 | 0.664 | 0.470 | 87.076 | 0.829 | 0.518 | 3.516 |
| fp32 | 多体 | H1106 | 236 | 0.203 | 0.144 | 0.860 | 0.181 | 0.151 | 20.457 | | fp32 | 多体 | H1106 | 236 | 0.203 | 0.144 | 0.860 | 0.181 | 0.151 | 20.457 |
## 应用场景 ## 应用场景
### 算法类别 ### 算法类别
......
![header](imgs/header.jpg)
# AlphaFold
This package provides an implementation of the inference pipeline of AlphaFold
v2. For simplicity, we refer to this model as AlphaFold throughout the rest of
this document.
We also provide:
1. An implementation of AlphaFold-Multimer. This represents a work in progress
and AlphaFold-Multimer isn't expected to be as stable as our monomer
AlphaFold system. [Read the guide](#updating-existing-installation) for how
to upgrade and update code.
2. The [technical note](docs/technical_note_v2.3.0.md) containing the models
and inference procedure for an updated AlphaFold v2.3.0.
3. A [CASP15 baseline](docs/casp15_predictions.zip) set of predictions along
with documentation of any manual interventions performed.
Any publication that discloses findings arising from using this source code or
the model parameters should [cite](#citing-this-work) the
[AlphaFold paper](https://doi.org/10.1038/s41586-021-03819-2) and, if
applicable, the
[AlphaFold-Multimer paper](https://www.biorxiv.org/content/10.1101/2021.10.04.463034v1).
Please also refer to the
[Supplementary Information](https://static-content.springer.com/esm/art%3A10.1038%2Fs41586-021-03819-2/MediaObjects/41586_2021_3819_MOESM1_ESM.pdf)
for a detailed description of the method.
**You can use a slightly simplified version of AlphaFold with
[this Colab notebook](https://colab.research.google.com/github/deepmind/alphafold/blob/main/notebooks/AlphaFold.ipynb)**
or community-supported versions (see below).
If you have any questions, please contact the AlphaFold team at
[alphafold@deepmind.com](mailto:alphafold@deepmind.com).
![CASP14 predictions](imgs/casp14_predictions.gif)
## Installation and running your first prediction
You will need a machine running Linux, AlphaFold does not support other
operating systems. Full installation requires up to 3 TB of disk space to keep
genetic databases (SSD storage is recommended) and a modern NVIDIA GPU (GPUs
with more memory can predict larger protein structures).
Please follow these steps:
1. Install [Docker](https://www.docker.com/).
* Install
[NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html)
for GPU support.
* Setup running
[Docker as a non-root user](https://docs.docker.com/engine/install/linux-postinstall/#manage-docker-as-a-non-root-user).
1. Clone this repository and `cd` into it.
```bash
git clone https://github.com/deepmind/alphafold.git
cd ./alphafold
```
1. Download genetic databases and model parameters:
* Install `aria2c`. On most Linux distributions it is available via the
package manager as the `aria2` package (on Debian-based distributions this
can be installed by running `sudo apt install aria2`).
* Please use the script `scripts/download_all_data.sh` to download
and set up full databases. This may take substantial time (download size is
556 GB), so we recommend running this script in the background:
```bash
scripts/download_all_data.sh <DOWNLOAD_DIR> > download.log 2> download_all.log &
```
* **Note: The download directory `<DOWNLOAD_DIR>` should *not* be a
subdirectory in the AlphaFold repository directory.** If it is, the Docker
build will be slow as the large databases will be copied into the docker
build context.
* It is possible to run AlphaFold with reduced databases; please refer to
the [complete documentation](#genetic-databases).
1. Check that AlphaFold will be able to use a GPU by running:
```bash
docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi
```
The output of this command should show a list of your GPUs. If it doesn't,
check if you followed all steps correctly when setting up the
[NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html)
or take a look at the following
[NVIDIA Docker issue](https://github.com/NVIDIA/nvidia-docker/issues/1447#issuecomment-801479573).
If you wish to run AlphaFold using Singularity (a common containerization
platform on HPC systems) we recommend using some of the third party Singularity
setups as linked in https://github.com/deepmind/alphafold/issues/10 or
https://github.com/deepmind/alphafold/issues/24.
1. Build the Docker image:
```bash
docker build -f docker/Dockerfile -t alphafold .
```
If you encounter the following error:
```
W: GPG error: https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64 InRelease: The following signatures couldn't be verified because the public key is not available: NO_PUBKEY A4B469963BF863CC
E: The repository 'https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64 InRelease' is not signed.
```
use the workaround described in
https://github.com/deepmind/alphafold/issues/463#issuecomment-1124881779.
1. Install the `run_docker.py` dependencies. Note: You may optionally wish to
create a
[Python Virtual Environment](https://docs.python.org/3/tutorial/venv.html)
to prevent conflicts with your system's Python environment.
```bash
pip3 install -r docker/requirements.txt
```
1. Make sure that the output directory exists (the default is `/tmp/alphafold`)
and that you have sufficient permissions to write into it.
1. Run `run_docker.py` pointing to a FASTA file containing the protein
sequence(s) for which you wish to predict the structure (`--fasta_paths`
parameter). AlphaFold will search for the available templates before the
date specified by the `--max_template_date` parameter; this could be used to
avoid certain templates during modeling. `--data_dir` is the directory with
downloaded genetic databases and `--output_dir` is the absolute path to the
output directory.
```bash
python3 docker/run_docker.py \
--fasta_paths=your_protein.fasta \
--max_template_date=2022-01-01 \
--data_dir=$DOWNLOAD_DIR \
--output_dir=/home/user/absolute_path_to_the_output_dir
```
1. Once the run is over, the output directory shall contain predicted
structures of the target protein. Please check the documentation below for
additional options and troubleshooting tips.
### Genetic databases
This step requires `aria2c` to be installed on your machine.
AlphaFold needs multiple genetic (sequence) databases to run:
* [BFD](https://bfd.mmseqs.com/),
* [MGnify](https://www.ebi.ac.uk/metagenomics/),
* [PDB70](http://wwwuser.gwdg.de/~compbiol/data/hhsuite/databases/hhsuite_dbs/),
* [PDB](https://www.rcsb.org/) (structures in the mmCIF format),
* [PDB seqres](https://www.rcsb.org/) – only for AlphaFold-Multimer,
* [UniRef30 (FKA UniClust30)](https://uniclust.mmseqs.com/),
* [UniProt](https://www.uniprot.org/uniprot/) – only for AlphaFold-Multimer,
* [UniRef90](https://www.uniprot.org/help/uniref).
We provide a script `scripts/download_all_data.sh` that can be used to download
and set up all of these databases:
* Recommended default:
```bash
scripts/download_all_data.sh <DOWNLOAD_DIR>
```
will download the full databases.
* With `reduced_dbs` parameter:
```bash
scripts/download_all_data.sh <DOWNLOAD_DIR> reduced_dbs
```
will download a reduced version of the databases to be used with the
`reduced_dbs` database preset. This shall be used with the corresponding
AlphaFold parameter `--db_preset=reduced_dbs` later during the AlphaFold run
(please see [AlphaFold parameters](#running-alphafold) section).
:ledger: **Note: The download directory `<DOWNLOAD_DIR>` should *not* be a
subdirectory in the AlphaFold repository directory.** If it is, the Docker build
will be slow as the large databases will be copied during the image creation.
We don't provide exactly the database versions used in CASP14 – see the
[note on reproducibility](#note-on-casp14-reproducibility). Some of the
databases are mirrored for speed, see [mirrored databases](#mirrored-databases).
:ledger: **Note: The total download size for the full databases is around 556 GB
and the total size when unzipped is 2.62 TB. Please make sure you have a large
enough hard drive space, bandwidth and time to download. We recommend using an
SSD for better genetic search performance.**
:ledger: **Note: If the download directory and datasets don't have full read and
write permissions, it can cause errors with the MSA tools, with opaque
(external) error messages. Please ensure the required permissions are applied,
e.g. with the `sudo chmod 755 --recursive "$DOWNLOAD_DIR"` command.**
The `download_all_data.sh` script will also download the model parameter files.
Once the script has finished, you should have the following directory structure:
```
$DOWNLOAD_DIR/ # Total: ~ 2.62 TB (download: 556 GB)
bfd/ # ~ 1.8 TB (download: 271.6 GB)
# 6 files.
mgnify/ # ~ 120 GB (download: 67 GB)
mgy_clusters_2022_05.fa
params/ # ~ 5.3 GB (download: 5.3 GB)
# 5 CASP14 models,
# 5 pTM models,
# 5 AlphaFold-Multimer models,
# LICENSE,
# = 16 files.
pdb70/ # ~ 56 GB (download: 19.5 GB)
# 9 files.
pdb_mmcif/ # ~ 238 GB (download: 43 GB)
mmcif_files/
# About 199,000 .cif files.
obsolete.dat
pdb_seqres/ # ~ 0.2 GB (download: 0.2 GB)
pdb_seqres.txt
small_bfd/ # ~ 17 GB (download: 9.6 GB)
bfd-first_non_consensus_sequences.fasta
uniref30/ # ~ 206 GB (download: 52.5 GB)
# 7 files.
uniprot/ # ~ 105 GB (download: 53 GB)
uniprot.fasta
uniref90/ # ~ 67 GB (download: 34 GB)
uniref90.fasta
```
`bfd/` is only downloaded if you download the full databases, and `small_bfd/`
is only downloaded if you download the reduced databases.
### Model parameters
While the AlphaFold code is licensed under the Apache 2.0 License, the AlphaFold
parameters and CASP15 prediction data are made available under the terms of the
CC BY 4.0 license. Please see the [Disclaimer](#license-and-disclaimer) below
for more detail.
The AlphaFold parameters are available from
https://storage.googleapis.com/alphafold/alphafold_params_2022-12-06.tar, and
are downloaded as part of the `scripts/download_all_data.sh` script. This script
will download parameters for:
* 5 models which were used during CASP14, and were extensively validated for
structure prediction quality (see Jumper et al. 2021, Suppl. Methods 1.12
for details).
* 5 pTM models, which were fine-tuned to produce pTM (predicted TM-score) and
(PAE) predicted aligned error values alongside their structure predictions
(see Jumper et al. 2021, Suppl. Methods 1.9.7 for details).
* 5 AlphaFold-Multimer models that produce pTM and PAE values alongside their
structure predictions.
### Updating existing installation
If you have a previous version you can either reinstall fully from scratch
(remove everything and run the setup from scratch) or you can do an incremental
update that will be significantly faster but will require a bit more work. Make
sure you follow these steps in the exact order they are listed below:
1. **Update the code.**
* Go to the directory with the cloned AlphaFold repository and run `git
fetch origin main` to get all code updates.
1. **Update the UniProt, UniRef, MGnify and PDB seqres databases.**
* Remove `<DOWNLOAD_DIR>/uniprot`.
* Run `scripts/download_uniprot.sh <DOWNLOAD_DIR>`.
* Remove `<DOWNLOAD_DIR>/uniclust30`.
* Run `scripts/download_uniref30.sh <DOWNLOAD_DIR>`.
* Remove `<DOWNLOAD_DIR>/uniref90`.
* Run `scripts/download_uniref90.sh <DOWNLOAD_DIR>`.
* Remove `<DOWNLOAD_DIR>/mgnify`.
* Run `scripts/download_mgnify.sh <DOWNLOAD_DIR>`.
* Remove `<DOWNLOAD_DIR>/pdb_mmcif`. It is needed to have PDB SeqRes and
PDB from exactly the same date. Failure to do this step will result in
potential errors when searching for templates when running
AlphaFold-Multimer.
* Run `scripts/download_pdb_mmcif.sh <DOWNLOAD_DIR>`.
* Run `scripts/download_pdb_seqres.sh <DOWNLOAD_DIR>`.
1. **Update the model parameters.**
* Remove the old model parameters in `<DOWNLOAD_DIR>/params`.
* Download new model parameters using
`scripts/download_alphafold_params.sh <DOWNLOAD_DIR>`.
1. **Follow [Running AlphaFold](#running-alphafold).**
#### Using deprecated model weights
To use the deprecated v2.2.0 AlphaFold-Multimer model weights:
1. Change `SOURCE_URL` in `scripts/download_alphafold_params.sh` to
`https://storage.googleapis.com/alphafold/alphafold_params_2022-03-02.tar`,
and download the old parameters.
2. Change the `_v3` to `_v2` in the multimer `MODEL_PRESETS` in `config.py`.
To use the deprecated v2.1.0 AlphaFold-Multimer model weights:
1. Change `SOURCE_URL` in `scripts/download_alphafold_params.sh` to
`https://storage.googleapis.com/alphafold/alphafold_params_2022-01-19.tar`,
and download the old parameters.
2. Remove the `_v3` in the multimer `MODEL_PRESETS` in `config.py`.
## Running AlphaFold
**The simplest way to run AlphaFold is using the provided Docker script.** This
was tested on Google Cloud with a machine using the `nvidia-gpu-cloud-image`
with 12 vCPUs, 85 GB of RAM, a 100 GB boot disk, the databases on an additional
3 TB disk, and an A100 GPU. For your first run, please follow the instructions
from [Installation and running your first prediction](#installation-and-running-your-first-prediction)
section.
1. By default, Alphafold will attempt to use all visible GPU devices. To use a
subset, specify a comma-separated list of GPU UUID(s) or index(es) using the
`--gpu_devices` flag. See
[GPU enumeration](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/user-guide.html#gpu-enumeration)
for more details.
1. You can control which AlphaFold model to run by adding the `--model_preset=`
flag. We provide the following models:
* **monomer**: This is the original model used at CASP14 with no
ensembling.
* **monomer\_casp14**: This is the original model used at CASP14 with
`num_ensemble=8`, matching our CASP14 configuration. This is largely
provided for reproducibility as it is 8x more computationally expensive
for limited accuracy gain (+0.1 average GDT gain on CASP14 domains).
* **monomer\_ptm**: This is the original CASP14 model fine tuned with the
pTM head, providing a pairwise confidence measure. It is slightly less
accurate than the normal monomer model.
* **multimer**: This is the [AlphaFold-Multimer](#citing-this-work) model.
To use this model, provide a multi-sequence FASTA file. In addition, the
UniProt database should have been downloaded.
1. You can control MSA speed/quality tradeoff by adding
`--db_preset=reduced_dbs` or `--db_preset=full_dbs` to the run command. We
provide the following presets:
* **reduced\_dbs**: This preset is optimized for speed and lower hardware
requirements. It runs with a reduced version of the BFD database. It
requires 8 CPU cores (vCPUs), 8 GB of RAM, and 600 GB of disk space.
* **full\_dbs**: This runs with all genetic databases used at CASP14.
Running the command above with the `monomer` model preset and the
`reduced_dbs` data preset would look like this:
```bash
python3 docker/run_docker.py \
--fasta_paths=T1050.fasta \
--max_template_date=2020-05-14 \
--model_preset=monomer \
--db_preset=reduced_dbs \
--data_dir=$DOWNLOAD_DIR \
--output_dir=/home/user/absolute_path_to_the_output_dir
```
1. After generating the predicted model, AlphaFold runs a relaxation
step to improve local geometry. By default, only the best model (by
pLDDT) is relaxed (`--models_to_relax=best`), but also all of the models
(`--models_to_relax=all`) or none of the models (`--models_to_relax=none`)
can be relaxed.
1. The relaxation step can be run on GPU (faster, but could be less stable) or
CPU (slow, but stable). This can be controlled with `--enable_gpu_relax=true`
(default) or `--enable_gpu_relax=false`.
1. AlphaFold can re-use MSAs (multiple sequence alignments) for the same
sequence via `--use_precomputed_msas=true` option; this can be useful for
trying different AlphaFold parameters. This option assumes that the
directory structure generated by the first AlphaFold run in the output
directory exists and that the protein sequence is the same.
### Running AlphaFold-Multimer
All steps are the same as when running the monomer system, but you will have to
* provide an input fasta with multiple sequences,
* set `--model_preset=multimer`,
An example that folds a protein complex `multimer.fasta`:
```bash
python3 docker/run_docker.py \
--fasta_paths=multimer.fasta \
--max_template_date=2020-05-14 \
--model_preset=multimer \
--data_dir=$DOWNLOAD_DIR \
--output_dir=/home/user/absolute_path_to_the_output_dir
```
By default the multimer system will run 5 seeds per model (25 total predictions)
for a small drop in accuracy you may wish to run a single seed per model. This
can be done via the `--num_multimer_predictions_per_model` flag, e.g. set it to
`--num_multimer_predictions_per_model=1` to run a single seed per model.
### AlphaFold prediction speed
The table below reports prediction runtimes for proteins of various lengths. We
only measure unrelaxed structure prediction with three recycles while
excluding runtimes from MSA and template search. When running
`docker/run_docker.py` with `--benchmark=true`, this runtime is stored in
`timings.json`. All runtimes are from a single A100 NVIDIA GPU. Prediction
speed on A100 for smaller structures can be improved by increasing
`global_config.subbatch_size` in `alphafold/model/config.py`.
No. residues | Prediction time (s)
-----------: | ------------------:
100 | 4.9
200 | 7.7
300 | 13
400 | 18
500 | 29
600 | 36
700 | 53
800 | 60
900 | 91
1,000 | 96
1,100 | 140
1,500 | 280
2,000 | 450
2,500 | 969
3,000 | 1,240
3,500 | 2,465
4,000 | 5,660
4,500 | 12,475
5,000 | 18,824
### Examples
Below are examples on how to use AlphaFold in different scenarios.
#### Folding a monomer
Say we have a monomer with the sequence `<SEQUENCE>`. The input fasta should be:
```fasta
>sequence_name
<SEQUENCE>
```
Then run the following command:
```bash
python3 docker/run_docker.py \
--fasta_paths=monomer.fasta \
--max_template_date=2021-11-01 \
--model_preset=monomer \
--data_dir=$DOWNLOAD_DIR \
--output_dir=/home/user/absolute_path_to_the_output_dir
```
#### Folding a homomer
Say we have a homomer with 3 copies of the same sequence `<SEQUENCE>`. The input
fasta should be:
```fasta
>sequence_1
<SEQUENCE>
>sequence_2
<SEQUENCE>
>sequence_3
<SEQUENCE>
```
Then run the following command:
```bash
python3 docker/run_docker.py \
--fasta_paths=homomer.fasta \
--max_template_date=2021-11-01 \
--model_preset=multimer \
--data_dir=$DOWNLOAD_DIR \
--output_dir=/home/user/absolute_path_to_the_output_dir
```
#### Folding a heteromer
Say we have an A2B3 heteromer, i.e. with 2 copies of `<SEQUENCE A>` and 3 copies
of `<SEQUENCE B>`. The input fasta should be:
```fasta
>sequence_1
<SEQUENCE A>
>sequence_2
<SEQUENCE A>
>sequence_3
<SEQUENCE B>
>sequence_4
<SEQUENCE B>
>sequence_5
<SEQUENCE B>
```
Then run the following command:
```bash
python3 docker/run_docker.py \
--fasta_paths=heteromer.fasta \
--max_template_date=2021-11-01 \
--model_preset=multimer \
--data_dir=$DOWNLOAD_DIR \
--output_dir=/home/user/absolute_path_to_the_output_dir
```
#### Folding multiple monomers one after another
Say we have a two monomers, `monomer1.fasta` and `monomer2.fasta`.
We can fold both sequentially by using the following command:
```bash
python3 docker/run_docker.py \
--fasta_paths=monomer1.fasta,monomer2.fasta \
--max_template_date=2021-11-01 \
--model_preset=monomer \
--data_dir=$DOWNLOAD_DIR \
--output_dir=/home/user/absolute_path_to_the_output_dir
```
#### Folding multiple multimers one after another
Say we have a two multimers, `multimer1.fasta` and `multimer2.fasta`.
We can fold both sequentially by using the following command:
```bash
python3 docker/run_docker.py \
--fasta_paths=multimer1.fasta,multimer2.fasta \
--max_template_date=2021-11-01 \
--model_preset=multimer \
--data_dir=$DOWNLOAD_DIR \
--output_dir=/home/user/absolute_path_to_the_output_dir
```
### AlphaFold output
The outputs will be saved in a subdirectory of the directory provided via the
`--output_dir` flag of `run_docker.py` (defaults to `/tmp/alphafold/`). The
outputs include the computed MSAs, unrelaxed structures, relaxed structures,
ranked structures, raw model outputs, prediction metadata, and section timings.
The `--output_dir` directory will have the following structure:
```
<target_name>/
features.pkl
ranked_{0,1,2,3,4}.pdb
ranking_debug.json
relax_metrics.json
relaxed_model_{1,2,3,4,5}.pdb
result_model_{1,2,3,4,5}.pkl
timings.json
unrelaxed_model_{1,2,3,4,5}.pdb
msas/
bfd_uniref_hits.a3m
mgnify_hits.sto
uniref90_hits.sto
```
The contents of each output file are as follows:
* `features.pkl` – A `pickle` file containing the input feature NumPy arrays
used by the models to produce the structures.
* `unrelaxed_model_*.pdb` – A PDB format text file containing the predicted
structure, exactly as outputted by the model.
* `relaxed_model_*.pdb` – A PDB format text file containing the predicted
structure, after performing an Amber relaxation procedure on the unrelaxed
structure prediction (see Jumper et al. 2021, Suppl. Methods 1.8.6 for
details).
* `ranked_*.pdb` – A PDB format text file containing the predicted structures,
after reordering by model confidence. Here `ranked_i.pdb` should contain
the prediction with the (`i + 1`)-th highest confidence (so that
`ranked_0.pdb` has the highest confidence). To rank model confidence, we use
predicted LDDT (pLDDT) scores (see Jumper et al. 2021, Suppl. Methods 1.9.6
for details). If `--models_to_relax=all` then all ranked structures are
relaxed. If `--models_to_relax=best` then only `ranked_0.pdb` is relaxed
(the rest are unrelaxed). If `--models_to_relax=none`, then the ranked
structures are all unrelaxed.
* `ranking_debug.json` – A JSON format text file containing the pLDDT values
used to perform the model ranking, and a mapping back to the original model
names.
* `relax_metrics.json` – A JSON format text file containing relax metrics, for
instance remaining violations.
* `timings.json` – A JSON format text file containing the times taken to run
each section of the AlphaFold pipeline.
* `msas/` - A directory containing the files describing the various genetic
tool hits that were used to construct the input MSA.
* `result_model_*.pkl` – A `pickle` file containing a nested dictionary of the
various NumPy arrays directly produced by the model. In addition to the
output of the structure module, this includes auxiliary outputs such as:
* Distograms (`distogram/logits` contains a NumPy array of shape [N_res,
N_res, N_bins] and `distogram/bin_edges` contains the definition of the
bins).
* Per-residue pLDDT scores (`plddt` contains a NumPy array of shape
[N_res] with the range of possible values from `0` to `100`, where `100`
means most confident). This can serve to identify sequence regions
predicted with high confidence or as an overall per-target confidence
score when averaged across residues.
* Present only if using pTM models: predicted TM-score (`ptm` field
contains a scalar). As a predictor of a global superposition metric,
this score is designed to also assess whether the model is confident in
the overall domain packing.
* Present only if using pTM models: predicted pairwise aligned errors
(`predicted_aligned_error` contains a NumPy array of shape [N_res,
N_res] with the range of possible values from `0` to
`max_predicted_aligned_error`, where `0` means most confident). This can
serve for a visualisation of domain packing confidence within the
structure.
The pLDDT confidence measure is stored in the B-factor field of the output PDB
files (although unlike a B-factor, higher pLDDT is better, so care must be taken
when using for tasks such as molecular replacement).
This code has been tested to match mean top-1 accuracy on a CASP14 test set with
pLDDT ranking over 5 model predictions (some CASP targets were run with earlier
versions of AlphaFold and some had manual interventions; see our forthcoming
publication for details). Some targets such as T1064 may also have high
individual run variance over random seeds.
## Inferencing many proteins
The provided inference script is optimized for predicting the structure of a
single protein, and it will compile the neural network to be specialized to
exactly the size of the sequence, MSA, and templates. For large proteins, the
compile time is a negligible fraction of the runtime, but it may become more
significant for small proteins or if the multi-sequence alignments are already
precomputed. In the bulk inference case, it may make sense to use our
`make_fixed_size` function to pad the inputs to a uniform size, thereby reducing
the number of compilations required.
We do not provide a bulk inference script, but it should be straightforward to
develop on top of the `RunModel.predict` method with a parallel system for
precomputing multi-sequence alignments. Alternatively, this script can be run
repeatedly with only moderate overhead.
## Note on CASP14 reproducibility
AlphaFold's output for a small number of proteins has high inter-run variance,
and may be affected by changes in the input data. The CASP14 target T1064 is a
notable example; the large number of SARS-CoV-2-related sequences recently
deposited changes its MSA significantly. This variability is somewhat mitigated
by the model selection process; running 5 models and taking the most confident.
To reproduce the results of our CASP14 system as closely as possible you must
use the same database versions we used in CASP. These may not match the default
versions downloaded by our scripts.
For genetics:
* UniRef90:
[v2020_01](https://ftp.uniprot.org/pub/databases/uniprot/previous_releases/release-2020_01/uniref/)
* MGnify:
[v2018_12](http://ftp.ebi.ac.uk/pub/databases/metagenomics/peptide_database/2018_12/)
* Uniclust30: [v2018_08](http://wwwuser.gwdg.de/~compbiol/uniclust/2018_08/)
* BFD: [only version available](https://bfd.mmseqs.com/)
For templates:
* PDB: (downloaded 2020-05-14)
* PDB70:
[2020-05-13](http://wwwuser.gwdg.de/~compbiol/data/hhsuite/databases/hhsuite_dbs/old-releases/pdb70_from_mmcif_200513.tar.gz)
An alternative for templates is to use the latest PDB and PDB70, but pass the
flag `--max_template_date=2020-05-14`, which restricts templates only to
structures that were available at the start of CASP14.
## Citing this work
If you use the code or data in this package, please cite:
```bibtex
@Article{AlphaFold2021,
author = {Jumper, John and Evans, Richard and Pritzel, Alexander and Green, Tim and Figurnov, Michael and Ronneberger, Olaf and Tunyasuvunakool, Kathryn and Bates, Russ and {\v{Z}}{\'\i}dek, Augustin and Potapenko, Anna and Bridgland, Alex and Meyer, Clemens and Kohl, Simon A A and Ballard, Andrew J and Cowie, Andrew and Romera-Paredes, Bernardino and Nikolov, Stanislav and Jain, Rishub and Adler, Jonas and Back, Trevor and Petersen, Stig and Reiman, David and Clancy, Ellen and Zielinski, Michal and Steinegger, Martin and Pacholska, Michalina and Berghammer, Tamas and Bodenstein, Sebastian and Silver, David and Vinyals, Oriol and Senior, Andrew W and Kavukcuoglu, Koray and Kohli, Pushmeet and Hassabis, Demis},
journal = {Nature},
title = {Highly accurate protein structure prediction with {AlphaFold}},
year = {2021},
volume = {596},
number = {7873},
pages = {583--589},
doi = {10.1038/s41586-021-03819-2}
}
```
In addition, if you use the AlphaFold-Multimer mode, please cite:
```bibtex
@article {AlphaFold-Multimer2021,
author = {Evans, Richard and O{\textquoteright}Neill, Michael and Pritzel, Alexander and Antropova, Natasha and Senior, Andrew and Green, Tim and {\v{Z}}{\'\i}dek, Augustin and Bates, Russ and Blackwell, Sam and Yim, Jason and Ronneberger, Olaf and Bodenstein, Sebastian and Zielinski, Michal and Bridgland, Alex and Potapenko, Anna and Cowie, Andrew and Tunyasuvunakool, Kathryn and Jain, Rishub and Clancy, Ellen and Kohli, Pushmeet and Jumper, John and Hassabis, Demis},
journal = {bioRxiv},
title = {Protein complex prediction with AlphaFold-Multimer},
year = {2021},
elocation-id = {2021.10.04.463034},
doi = {10.1101/2021.10.04.463034},
URL = {https://www.biorxiv.org/content/early/2021/10/04/2021.10.04.463034},
eprint = {https://www.biorxiv.org/content/early/2021/10/04/2021.10.04.463034.full.pdf},
}
```
## Community contributions
Colab notebooks provided by the community (please note that these notebooks may
vary from our full AlphaFold system and we did not validate their accuracy):
* The
[ColabFold AlphaFold2 notebook](https://colab.research.google.com/github/sokrypton/ColabFold/blob/main/AlphaFold2.ipynb)
by Martin Steinegger, Sergey Ovchinnikov and Milot Mirdita, which uses an
API hosted at the Södinglab based on the MMseqs2 server
[(Mirdita et al. 2019, Bioinformatics)](https://academic.oup.com/bioinformatics/article/35/16/2856/5280135)
for the multiple sequence alignment creation.
## Acknowledgements
AlphaFold communicates with and/or references the following separate libraries
and packages:
* [Abseil](https://github.com/abseil/abseil-py)
* [Biopython](https://biopython.org)
* [Chex](https://github.com/deepmind/chex)
* [Colab](https://research.google.com/colaboratory/)
* [Docker](https://www.docker.com)
* [HH Suite](https://github.com/soedinglab/hh-suite)
* [HMMER Suite](http://eddylab.org/software/hmmer)
* [Haiku](https://github.com/deepmind/dm-haiku)
* [Immutabledict](https://github.com/corenting/immutabledict)
* [JAX](https://github.com/google/jax/)
* [Kalign](https://msa.sbc.su.se/cgi-bin/msa.cgi)
* [matplotlib](https://matplotlib.org/)
* [ML Collections](https://github.com/google/ml_collections)
* [NumPy](https://numpy.org)
* [OpenMM](https://github.com/openmm/openmm)
* [OpenStructure](https://openstructure.org)
* [pandas](https://pandas.pydata.org/)
* [pymol3d](https://github.com/avirshup/py3dmol)
* [SciPy](https://scipy.org)
* [Sonnet](https://github.com/deepmind/sonnet)
* [TensorFlow](https://github.com/tensorflow/tensorflow)
* [Tree](https://github.com/deepmind/tree)
* [tqdm](https://github.com/tqdm/tqdm)
We thank all their contributors and maintainers!
## Get in Touch
If you have any questions not covered in this overview, please contact the
AlphaFold team at [alphafold@deepmind.com](mailto:alphafold@deepmind.com).
We would love to hear your feedback and understand how AlphaFold has been useful
in your research. Share your stories with us at
[alphafold@deepmind.com](mailto:alphafold@deepmind.com).
## License and Disclaimer
This is not an officially supported Google product.
Copyright 2022 DeepMind Technologies Limited.
### AlphaFold Code License
Licensed under the Apache License, Version 2.0 (the "License"); you may not use
this file except in compliance with the License. You may obtain a copy of the
License at https://www.apache.org/licenses/LICENSE-2.0.
Unless required by applicable law or agreed to in writing, software distributed
under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR
CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
### Model Parameters License
The AlphaFold parameters are made available under the terms of the Creative
Commons Attribution 4.0 International (CC BY 4.0) license. You can find details
at: https://creativecommons.org/licenses/by/4.0/legalcode
### Third-party software
Use of the third-party software, libraries or code referred to in the
[Acknowledgements](#acknowledgements) section above may be governed by separate
terms and conditions or license provisions. Your use of the third-party
software, libraries or code is subject to any such terms and you should check
that you can comply with any applicable restrictions or terms and conditions
before use.
### Mirrored Databases
The following databases have been mirrored by DeepMind, and are available with
reference to the following:
* [BFD](https://bfd.mmseqs.com/) (unmodified), by Steinegger M. and Söding J.,
available under a
[Creative Commons Attribution-ShareAlike 4.0 International License](http://creativecommons.org/licenses/by-sa/4.0/).
* [BFD](https://bfd.mmseqs.com/) (modified), by Steinegger M. and Söding J.,
modified by DeepMind, available under a
[Creative Commons Attribution-ShareAlike 4.0 International License](http://creativecommons.org/licenses/by-sa/4.0/).
See the Methods section of the
[AlphaFold proteome paper](https://www.nature.com/articles/s41586-021-03828-1)
for details.
* [Uniref30: v2021_03](http://wwwuser.gwdg.de/~compbiol/uniclust/2021_03/)
(unmodified), by Mirdita M. et al., available under a
[Creative Commons Attribution-ShareAlike 4.0 International License](http://creativecommons.org/licenses/by-sa/4.0/).
* [MGnify: v2022_05](http://ftp.ebi.ac.uk/pub/databases/metagenomics/peptide_database/2022_05/README.txt)
(unmodified), by Mitchell AL et al., available free of all copyright
restrictions and made fully and freely available for both non-commercial and
commercial use under
[CC0 1.0 Universal (CC0 1.0) Public Domain Dedication](https://creativecommons.org/publicdomain/zero/1.0/).
...@@ -14,7 +14,9 @@ ...@@ -14,7 +14,9 @@
"""Functions for processing confidence metrics.""" """Functions for processing confidence metrics."""
import json
from typing import Dict, Optional, Tuple from typing import Dict, Optional, Tuple
import numpy as np import numpy as np
import scipy.special import scipy.special
...@@ -36,6 +38,43 @@ def compute_plddt(logits: np.ndarray) -> np.ndarray: ...@@ -36,6 +38,43 @@ def compute_plddt(logits: np.ndarray) -> np.ndarray:
return predicted_lddt_ca * 100 return predicted_lddt_ca * 100
def _confidence_category(score: float) -> str:
"""Categorizes pLDDT into: disordered (D), low (L), medium (M), high (H)."""
if 0 <= score < 50:
return 'D'
if 50 <= score < 70:
return 'L'
elif 70 <= score < 90:
return 'M'
elif 90 <= score <= 100:
return 'H'
else:
raise ValueError(f'Invalid pLDDT score {score}')
def confidence_json(plddt: np.ndarray) -> str:
"""Returns JSON with confidence score and category for every residue.
Args:
plddt: Per-residue confidence metric data.
Returns:
String with a formatted JSON.
Raises:
ValueError: If `plddt` has a rank different than 1.
"""
if plddt.ndim != 1:
raise ValueError(f'The plddt array must be rank 1, got: {plddt.shape}.')
confidence = {
'residueNumber': list(range(1, len(plddt) + 1)),
'confidenceScore': [round(float(s), 2) for s in plddt],
'confidenceCategory': [_confidence_category(s) for s in plddt],
}
return json.dumps(confidence, indent=None, separators=(',', ':'))
def _calculate_bin_centers(breaks: np.ndarray): def _calculate_bin_centers(breaks: np.ndarray):
"""Gets the bin centers from the bin edges. """Gets the bin centers from the bin edges.
...@@ -108,6 +147,32 @@ def compute_predicted_aligned_error( ...@@ -108,6 +147,32 @@ def compute_predicted_aligned_error(
} }
def pae_json(pae: np.ndarray, max_pae: float) -> str:
"""Returns the PAE in the same format as is used in the AFDB.
Note that the values are presented as floats to 1 decimal place, whereas AFDB
returns integer values.
Args:
pae: The n_res x n_res PAE array.
max_pae: The maximum possible PAE value.
Returns:
PAE output format as a JSON string.
"""
# Check the PAE array is the correct shape.
if pae.ndim != 2 or pae.shape[0] != pae.shape[1]:
raise ValueError(f'PAE must be a square matrix, got {pae.shape}')
# Round the predicted aligned errors to 1 decimal place.
rounded_errors = np.round(pae.astype(np.float64), decimals=1)
formatted_output = [{
'predicted_aligned_error': rounded_errors.tolist(),
'max_predicted_aligned_error': max_pae,
}]
return json.dumps(formatted_output, indent=None, separators=(',', ':'))
def predicted_tm_score( def predicted_tm_score(
logits: np.ndarray, logits: np.ndarray,
breaks: np.ndarray, breaks: np.ndarray,
......
# Copyright 2023 DeepMind Technologies Limited
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Test confidence metrics."""
from absl.testing import absltest
from alphafold.common import confidence
import numpy as np
class ConfidenceTest(absltest.TestCase):
def test_pae_json(self):
pae = np.array([[0.01, 13.12345], [20.0987, 0.0]])
pae_json = confidence.pae_json(pae=pae, max_pae=31.75)
self.assertEqual(
pae_json, '[{"predicted_aligned_error":[[0.0,13.1],[20.1,0.0]],'
'"max_predicted_aligned_error":31.75}]')
def test_confidence_json(self):
plddt = np.array([42, 42.42])
confidence_json = confidence.confidence_json(plddt=plddt)
print(confidence_json)
self.assertEqual(
confidence_json,
('{"residueNumber":[1,2],'
'"confidenceScore":[42.0,42.42],'
'"confidenceCategory":["D","D"]}'),
)
if __name__ == '__main__':
absltest.main()
# Copyright 2021 DeepMind Technologies Limited
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""mmCIF metadata."""
from typing import Mapping, Sequence
from alphafold import version
import numpy as np
_DISCLAIMER = """ALPHAFOLD DATA, COPYRIGHT (2021) DEEPMIND TECHNOLOGIES LIMITED.
THE INFORMATION PROVIDED IS THEORETICAL MODELLING ONLY AND CAUTION SHOULD BE
EXERCISED IN ITS USE. IT IS PROVIDED "AS-IS" WITHOUT ANY WARRANTY OF ANY KIND,
WHETHER EXPRESSED OR IMPLIED. NO WARRANTY IS GIVEN THAT USE OF THE INFORMATION
SHALL NOT INFRINGE THE RIGHTS OF ANY THIRD PARTY. DISCLAIMER: THE INFORMATION IS
NOT INTENDED TO BE A SUBSTITUTE FOR PROFESSIONAL MEDICAL ADVICE, DIAGNOSIS, OR
TREATMENT, AND DOES NOT CONSTITUTE MEDICAL OR OTHER PROFESSIONAL ADVICE. IT IS
AVAILABLE FOR ACADEMIC AND COMMERCIAL PURPOSES, UNDER CC-BY 4.0 LICENCE."""
# Authors of the Nature methods paper we reference in the mmCIF.
_MMCIF_PAPER_AUTHORS = (
'Jumper, John',
'Evans, Richard',
'Pritzel, Alexander',
'Green, Tim',
'Figurnov, Michael',
'Ronneberger, Olaf',
'Tunyasuvunakool, Kathryn',
'Bates, Russ',
'Zidek, Augustin',
'Potapenko, Anna',
'Bridgland, Alex',
'Meyer, Clemens',
'Kohl, Simon A. A.',
'Ballard, Andrew J.',
'Cowie, Andrew',
'Romera-Paredes, Bernardino',
'Nikolov, Stanislav',
'Jain, Rishub',
'Adler, Jonas',
'Back, Trevor',
'Petersen, Stig',
'Reiman, David',
'Clancy, Ellen',
'Zielinski, Michal',
'Steinegger, Martin',
'Pacholska, Michalina',
'Berghammer, Tamas',
'Silver, David',
'Vinyals, Oriol',
'Senior, Andrew W.',
'Kavukcuoglu, Koray',
'Kohli, Pushmeet',
'Hassabis, Demis',
)
# Authors of the mmCIF - we set them to be equal to the authors of the paper.
_MMCIF_AUTHORS = _MMCIF_PAPER_AUTHORS
def add_metadata_to_mmcif(
old_cif: Mapping[str, Sequence[str]], model_type: str
) -> Mapping[str, Sequence[str]]:
"""Adds AlphaFold metadata in the given mmCIF."""
cif = {}
# ModelCIF conformation dictionary.
cif['_audit_conform.dict_name'] = ['mmcif_ma.dic']
cif['_audit_conform.dict_version'] = ['1.3.9']
cif['_audit_conform.dict_location'] = [
'https://raw.githubusercontent.com/ihmwg/ModelCIF/master/dist/'
'mmcif_ma.dic'
]
# License and disclaimer.
cif['_pdbx_data_usage.id'] = ['1', '2']
cif['_pdbx_data_usage.type'] = ['license', 'disclaimer']
cif['_pdbx_data_usage.details'] = [
'Data in this file is available under a CC-BY-4.0 license.',
_DISCLAIMER,
]
cif['_pdbx_data_usage.url'] = [
'https://creativecommons.org/licenses/by/4.0/',
'?',
]
cif['_pdbx_data_usage.name'] = ['CC-BY-4.0', '?']
# Structure author details.
cif['_audit_author.name'] = []
cif['_audit_author.pdbx_ordinal'] = []
for author_index, author_name in enumerate(_MMCIF_AUTHORS, start=1):
cif['_audit_author.name'].append(author_name)
cif['_audit_author.pdbx_ordinal'].append(str(author_index))
# Paper author details.
cif['_citation_author.citation_id'] = []
cif['_citation_author.name'] = []
cif['_citation_author.ordinal'] = []
for author_index, author_name in enumerate(_MMCIF_PAPER_AUTHORS, start=1):
cif['_citation_author.citation_id'].append('primary')
cif['_citation_author.name'].append(author_name)
cif['_citation_author.ordinal'].append(str(author_index))
# Paper citation details.
cif['_citation.id'] = ['primary']
cif['_citation.title'] = [
'Highly accurate protein structure prediction with AlphaFold'
]
cif['_citation.journal_full'] = ['Nature']
cif['_citation.journal_volume'] = ['596']
cif['_citation.page_first'] = ['583']
cif['_citation.page_last'] = ['589']
cif['_citation.year'] = ['2021']
cif['_citation.journal_id_ASTM'] = ['NATUAS']
cif['_citation.country'] = ['UK']
cif['_citation.journal_id_ISSN'] = ['0028-0836']
cif['_citation.journal_id_CSD'] = ['0006']
cif['_citation.book_publisher'] = ['?']
cif['_citation.pdbx_database_id_PubMed'] = ['34265844']
cif['_citation.pdbx_database_id_DOI'] = ['10.1038/s41586-021-03819-2']
# Type of data in the dataset including data used in the model generation.
cif['_ma_data.id'] = ['1']
cif['_ma_data.name'] = ['Model']
cif['_ma_data.content_type'] = ['model coordinates']
# Description of number of instances for each entity.
cif['_ma_target_entity_instance.asym_id'] = old_cif['_struct_asym.id']
cif['_ma_target_entity_instance.entity_id'] = old_cif[
'_struct_asym.entity_id'
]
cif['_ma_target_entity_instance.details'] = ['.'] * len(
cif['_ma_target_entity_instance.entity_id']
)
# Details about the target entities.
cif['_ma_target_entity.entity_id'] = cif[
'_ma_target_entity_instance.entity_id'
]
cif['_ma_target_entity.data_id'] = ['1'] * len(
cif['_ma_target_entity.entity_id']
)
cif['_ma_target_entity.origin'] = ['.'] * len(
cif['_ma_target_entity.entity_id']
)
# Details of the models being deposited.
cif['_ma_model_list.ordinal_id'] = ['1']
cif['_ma_model_list.model_id'] = ['1']
cif['_ma_model_list.model_group_id'] = ['1']
cif['_ma_model_list.model_name'] = ['Top ranked model']
cif['_ma_model_list.model_group_name'] = [
f'AlphaFold {model_type} v{version.__version__} model'
]
cif['_ma_model_list.data_id'] = ['1']
cif['_ma_model_list.model_type'] = ['Ab initio model']
# Software used.
cif['_software.pdbx_ordinal'] = ['1']
cif['_software.name'] = ['AlphaFold']
cif['_software.version'] = [f'v{version.__version__}']
cif['_software.type'] = ['package']
cif['_software.description'] = ['Structure prediction']
cif['_software.classification'] = ['other']
cif['_software.date'] = ['?']
# Collection of software into groups.
cif['_ma_software_group.ordinal_id'] = ['1']
cif['_ma_software_group.group_id'] = ['1']
cif['_ma_software_group.software_id'] = ['1']
# Method description to conform with ModelCIF.
cif['_ma_protocol_step.ordinal_id'] = ['1', '2', '3']
cif['_ma_protocol_step.protocol_id'] = ['1', '1', '1']
cif['_ma_protocol_step.step_id'] = ['1', '2', '3']
cif['_ma_protocol_step.method_type'] = [
'coevolution MSA',
'template search',
'modeling',
]
# Details of the metrics use to assess model confidence.
cif['_ma_qa_metric.id'] = ['1', '2']
cif['_ma_qa_metric.name'] = ['pLDDT', 'pLDDT']
# Accepted values are distance, energy, normalised score, other, zscore.
cif['_ma_qa_metric.type'] = ['pLDDT', 'pLDDT']
cif['_ma_qa_metric.mode'] = ['global', 'local']
cif['_ma_qa_metric.software_group_id'] = ['1', '1']
# Global model confidence metric value.
cif['_ma_qa_metric_global.ordinal_id'] = ['1']
cif['_ma_qa_metric_global.model_id'] = ['1']
cif['_ma_qa_metric_global.metric_id'] = ['1']
global_plddt = np.mean(
[float(v) for v in old_cif['_atom_site.B_iso_or_equiv']]
)
cif['_ma_qa_metric_global.metric_value'] = [f'{global_plddt:.2f}']
cif['_atom_type.symbol'] = sorted(set(old_cif['_atom_site.type_symbol']))
return cif
...@@ -13,11 +13,18 @@ ...@@ -13,11 +13,18 @@
# limitations under the License. # limitations under the License.
"""Protein data type.""" """Protein data type."""
import collections
import dataclasses import dataclasses
import functools
import io import io
from typing import Any, Mapping, Optional from typing import Any, Dict, List, Mapping, Optional, Tuple
from alphafold.common import mmcif_metadata
from alphafold.common import residue_constants from alphafold.common import residue_constants
from Bio.PDB import MMCIFParser
from Bio.PDB import PDBParser from Bio.PDB import PDBParser
from Bio.PDB.mmcifio import MMCIFIO
from Bio.PDB.Structure import Structure
import numpy as np import numpy as np
FeatureDict = Mapping[str, np.ndarray] FeatureDict = Mapping[str, np.ndarray]
...@@ -27,6 +34,32 @@ ModelOutput = Mapping[str, Any] # Is a nested dict. ...@@ -27,6 +34,32 @@ ModelOutput = Mapping[str, Any] # Is a nested dict.
PDB_CHAIN_IDS = 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789' PDB_CHAIN_IDS = 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789'
PDB_MAX_CHAINS = len(PDB_CHAIN_IDS) # := 62. PDB_MAX_CHAINS = len(PDB_CHAIN_IDS) # := 62.
# Data to fill the _chem_comp table when writing mmCIFs.
_CHEM_COMP: Mapping[str, Tuple[Tuple[str, str], ...]] = {
'L-peptide linking': (
('ALA', 'ALANINE'),
('ARG', 'ARGININE'),
('ASN', 'ASPARAGINE'),
('ASP', 'ASPARTIC ACID'),
('CYS', 'CYSTEINE'),
('GLN', 'GLUTAMINE'),
('GLU', 'GLUTAMIC ACID'),
('HIS', 'HISTIDINE'),
('ILE', 'ISOLEUCINE'),
('LEU', 'LEUCINE'),
('LYS', 'LYSINE'),
('MET', 'METHIONINE'),
('PHE', 'PHENYLALANINE'),
('PRO', 'PROLINE'),
('SER', 'SERINE'),
('THR', 'THREONINE'),
('TRP', 'TRYPTOPHAN'),
('TYR', 'TYROSINE'),
('VAL', 'VALINE'),
),
'peptide linking': (('GLY', 'GLYCINE'),),
}
@dataclasses.dataclass(frozen=True) @dataclasses.dataclass(frozen=True)
class Protein: class Protein:
...@@ -63,27 +96,32 @@ class Protein: ...@@ -63,27 +96,32 @@ class Protein:
'because these cannot be written to PDB format.') 'because these cannot be written to PDB format.')
def from_pdb_string(pdb_str: str, chain_id: Optional[str] = None) -> Protein: def _from_bio_structure(
"""Takes a PDB string and constructs a Protein object. structure: Structure, chain_id: Optional[str] = None
) -> Protein:
"""Takes a Biopython structure and creates a `Protein` instance.
WARNING: All non-standard residue types will be converted into UNK. All WARNING: All non-standard residue types will be converted into UNK. All
non-standard atoms will be ignored. non-standard atoms will be ignored.
Args: Args:
pdb_str: The contents of the pdb file structure: Structure from the Biopython library.
chain_id: If chain_id is specified (e.g. A), then only that chain chain_id: If chain_id is specified (e.g. A), then only that chain is parsed.
is parsed. Otherwise all chains are parsed. Otherwise all chains are parsed.
Returns: Returns:
A new `Protein` parsed from the pdb contents. A new `Protein` created from the structure contents.
Raises:
ValueError: If the number of models included in the structure is not 1.
ValueError: If insertion code is detected at a residue.
""" """
pdb_fh = io.StringIO(pdb_str)
parser = PDBParser(QUIET=True)
structure = parser.get_structure('none', pdb_fh)
models = list(structure.get_models()) models = list(structure.get_models())
if len(models) != 1: if len(models) != 1:
raise ValueError( raise ValueError(
f'Only single model PDBs are supported. Found {len(models)} models.') 'Only single model PDBs/mmCIFs are supported. Found'
f' {len(models)} models.'
)
model = models[0] model = models[0]
atom_positions = [] atom_positions = []
...@@ -99,8 +137,9 @@ def from_pdb_string(pdb_str: str, chain_id: Optional[str] = None) -> Protein: ...@@ -99,8 +137,9 @@ def from_pdb_string(pdb_str: str, chain_id: Optional[str] = None) -> Protein:
for res in chain: for res in chain:
if res.id[2] != ' ': if res.id[2] != ' ':
raise ValueError( raise ValueError(
f'PDB contains an insertion code at chain {chain.id} and residue ' f'PDB/mmCIF contains an insertion code at chain {chain.id} and'
f'index {res.id[1]}. These are not supported.') f' residue index {res.id[1]}. These are not supported.'
)
res_shortname = residue_constants.restype_3to1.get(res.resname, 'X') res_shortname = residue_constants.restype_3to1.get(res.resname, 'X')
restype_idx = residue_constants.restype_order.get( restype_idx = residue_constants.restype_order.get(
res_shortname, residue_constants.restype_num) res_shortname, residue_constants.restype_num)
...@@ -137,6 +176,48 @@ def from_pdb_string(pdb_str: str, chain_id: Optional[str] = None) -> Protein: ...@@ -137,6 +176,48 @@ def from_pdb_string(pdb_str: str, chain_id: Optional[str] = None) -> Protein:
b_factors=np.array(b_factors)) b_factors=np.array(b_factors))
def from_pdb_string(pdb_str: str, chain_id: Optional[str] = None) -> Protein:
"""Takes a PDB string and constructs a `Protein` object.
WARNING: All non-standard residue types will be converted into UNK. All
non-standard atoms will be ignored.
Args:
pdb_str: The contents of the pdb file
chain_id: If chain_id is specified (e.g. A), then only that chain is parsed.
Otherwise all chains are parsed.
Returns:
A new `Protein` parsed from the pdb contents.
"""
with io.StringIO(pdb_str) as pdb_fh:
parser = PDBParser(QUIET=True)
structure = parser.get_structure(id='none', file=pdb_fh)
return _from_bio_structure(structure, chain_id)
def from_mmcif_string(
mmcif_str: str, chain_id: Optional[str] = None
) -> Protein:
"""Takes a mmCIF string and constructs a `Protein` object.
WARNING: All non-standard residue types will be converted into UNK. All
non-standard atoms will be ignored.
Args:
mmcif_str: The contents of the mmCIF file
chain_id: If chain_id is specified (e.g. A), then only that chain is parsed.
Otherwise all chains are parsed.
Returns:
A new `Protein` parsed from the mmCIF contents.
"""
with io.StringIO(mmcif_str) as mmcif_fh:
parser = MMCIFParser(QUIET=True)
structure = parser.get_structure(structure_id='none', filename=mmcif_fh)
return _from_bio_structure(structure, chain_id)
def _chain_end(atom_index, end_resname, chain_name, residue_index) -> str: def _chain_end(atom_index, end_resname, chain_name, residue_index) -> str:
chain_end = 'TER' chain_end = 'TER'
return (f'{chain_end:<6}{atom_index:>5} {end_resname:>3} ' return (f'{chain_end:<6}{atom_index:>5} {end_resname:>3} '
...@@ -276,3 +357,223 @@ def from_prediction( ...@@ -276,3 +357,223 @@ def from_prediction(
residue_index=_maybe_remove_leading_dim(features['residue_index']) + 1, residue_index=_maybe_remove_leading_dim(features['residue_index']) + 1,
chain_index=chain_index, chain_index=chain_index,
b_factors=b_factors) b_factors=b_factors)
def to_mmcif(
prot: Protein,
file_id: str,
model_type: str,
) -> str:
"""Converts a `Protein` instance to an mmCIF string.
WARNING 1: The _entity_poly_seq is filled with unknown (UNK) residues for any
missing residue indices in the range from min(1, min(residue_index)) to
max(residue_index). E.g. for a protein object with positions for residues
2 (MET), 3 (LYS), 6 (GLY), this method would set the _entity_poly_seq to:
1 UNK
2 MET
3 LYS
4 UNK
5 UNK
6 GLY
This is done to preserve the residue numbering.
WARNING 2: Converting ground truth mmCIF file to Protein and then back to
mmCIF using this method will convert all non-standard residue types to UNK.
If you need this behaviour, you need to store more mmCIF metadata in the
Protein object (e.g. all fields except for the _atom_site loop).
WARNING 3: Converting ground truth mmCIF file to Protein and then back to
mmCIF using this method will not retain the original chain indices.
WARNING 4: In case of multiple identical chains, they are assigned different
`_atom_site.label_entity_id` values.
Args:
prot: A protein to convert to mmCIF string.
file_id: The file ID (usually the PDB ID) to be used in the mmCIF.
model_type: 'Multimer' or 'Monomer'.
Returns:
A valid mmCIF string.
Raises:
ValueError: If aminoacid types array contains entries with too many protein
types.
"""
atom_mask = prot.atom_mask
aatype = prot.aatype
atom_positions = prot.atom_positions
residue_index = prot.residue_index.astype(np.int32)
chain_index = prot.chain_index.astype(np.int32)
b_factors = prot.b_factors
# Construct a mapping from chain integer indices to chain ID strings.
chain_ids = {}
# We count unknown residues as protein residues.
for entity_id in np.unique(chain_index): # np.unique gives sorted output.
chain_ids[entity_id] = _int_id_to_str_id(entity_id + 1)
mmcif_dict = collections.defaultdict(list)
mmcif_dict['data_'] = file_id.upper()
mmcif_dict['_entry.id'] = file_id.upper()
label_asym_id_to_entity_id = {}
# Entity and chain information.
for entity_id, chain_id in chain_ids.items():
# Add all chain information to the _struct_asym table.
label_asym_id_to_entity_id[str(chain_id)] = str(entity_id)
mmcif_dict['_struct_asym.id'].append(chain_id)
mmcif_dict['_struct_asym.entity_id'].append(str(entity_id))
# Add information about the entity to the _entity_poly table.
mmcif_dict['_entity_poly.entity_id'].append(str(entity_id))
mmcif_dict['_entity_poly.type'].append(residue_constants.PROTEIN_CHAIN)
mmcif_dict['_entity_poly.pdbx_strand_id'].append(chain_id)
# Generate the _entity table.
mmcif_dict['_entity.id'].append(str(entity_id))
mmcif_dict['_entity.type'].append(residue_constants.POLYMER_CHAIN)
# Add the residues to the _entity_poly_seq table.
for entity_id, (res_ids, aas) in _get_entity_poly_seq(
aatype, residue_index, chain_index
).items():
for res_id, aa in zip(res_ids, aas):
mmcif_dict['_entity_poly_seq.entity_id'].append(str(entity_id))
mmcif_dict['_entity_poly_seq.num'].append(str(res_id))
mmcif_dict['_entity_poly_seq.mon_id'].append(
residue_constants.resnames[aa]
)
# Populate the chem comp table.
for chem_type, chem_comp in _CHEM_COMP.items():
for chem_id, chem_name in chem_comp:
mmcif_dict['_chem_comp.id'].append(chem_id)
mmcif_dict['_chem_comp.type'].append(chem_type)
mmcif_dict['_chem_comp.name'].append(chem_name)
# Add all atom sites.
atom_index = 1
for i in range(aatype.shape[0]):
res_name_3 = residue_constants.resnames[aatype[i]]
if aatype[i] <= len(residue_constants.restypes):
atom_names = residue_constants.atom_types
else:
raise ValueError(
'Amino acid types array contains entries with too many protein types.'
)
for atom_name, pos, mask, b_factor in zip(
atom_names, atom_positions[i], atom_mask[i], b_factors[i]
):
if mask < 0.5:
continue
type_symbol = residue_constants.atom_id_to_type(atom_name)
mmcif_dict['_atom_site.group_PDB'].append('ATOM')
mmcif_dict['_atom_site.id'].append(str(atom_index))
mmcif_dict['_atom_site.type_symbol'].append(type_symbol)
mmcif_dict['_atom_site.label_atom_id'].append(atom_name)
mmcif_dict['_atom_site.label_alt_id'].append('.')
mmcif_dict['_atom_site.label_comp_id'].append(res_name_3)
mmcif_dict['_atom_site.label_asym_id'].append(chain_ids[chain_index[i]])
mmcif_dict['_atom_site.label_entity_id'].append(
label_asym_id_to_entity_id[chain_ids[chain_index[i]]]
)
mmcif_dict['_atom_site.label_seq_id'].append(str(residue_index[i]))
mmcif_dict['_atom_site.pdbx_PDB_ins_code'].append('.')
mmcif_dict['_atom_site.Cartn_x'].append(f'{pos[0]:.3f}')
mmcif_dict['_atom_site.Cartn_y'].append(f'{pos[1]:.3f}')
mmcif_dict['_atom_site.Cartn_z'].append(f'{pos[2]:.3f}')
mmcif_dict['_atom_site.occupancy'].append('1.00')
mmcif_dict['_atom_site.B_iso_or_equiv'].append(f'{b_factor:.2f}')
mmcif_dict['_atom_site.auth_seq_id'].append(str(residue_index[i]))
mmcif_dict['_atom_site.auth_asym_id'].append(chain_ids[chain_index[i]])
mmcif_dict['_atom_site.pdbx_PDB_model_num'].append('1')
atom_index += 1
metadata_dict = mmcif_metadata.add_metadata_to_mmcif(mmcif_dict, model_type)
mmcif_dict.update(metadata_dict)
return _create_mmcif_string(mmcif_dict)
@functools.lru_cache(maxsize=256)
def _int_id_to_str_id(num: int) -> str:
"""Encodes a number as a string, using reverse spreadsheet style naming.
Args:
num: A positive integer.
Returns:
A string that encodes the positive integer using reverse spreadsheet style,
naming e.g. 1 = A, 2 = B, ..., 27 = AA, 28 = BA, 29 = CA, ... This is the
usual way to encode chain IDs in mmCIF files.
"""
if num <= 0:
raise ValueError(f'Only positive integers allowed, got {num}.')
num = num - 1 # 1-based indexing.
output = []
while num >= 0:
output.append(chr(num % 26 + ord('A')))
num = num // 26 - 1
return ''.join(output)
def _get_entity_poly_seq(
aatypes: np.ndarray, residue_indices: np.ndarray, chain_indices: np.ndarray
) -> Dict[int, Tuple[List[int], List[int]]]:
"""Constructs gapless residue index and aatype lists for each chain.
Args:
aatypes: A numpy array with aatypes.
residue_indices: A numpy array with residue indices.
chain_indices: A numpy array with chain indices.
Returns:
A dictionary mapping chain indices to a tuple with list of residue indices
and a list of aatypes. Missing residues are filled with UNK residue type.
"""
if (
aatypes.shape[0] != residue_indices.shape[0]
or aatypes.shape[0] != chain_indices.shape[0]
):
raise ValueError(
'aatypes, residue_indices, chain_indices must have the same length.'
)
# Group the present residues by chain index.
present = collections.defaultdict(list)
for chain_id, res_id, aa in zip(chain_indices, residue_indices, aatypes):
present[chain_id].append((res_id, aa))
# Add any missing residues (from 1 to the first residue and for any gaps).
entity_poly_seq = {}
for chain_id, present_residues in present.items():
present_residue_indices = set([x[0] for x in present_residues])
min_res_id = min(present_residue_indices) # Could be negative.
max_res_id = max(present_residue_indices)
new_residue_indices = []
new_aatypes = []
present_index = 0
for i in range(min(1, min_res_id), max_res_id + 1):
new_residue_indices.append(i)
if i in present_residue_indices:
new_aatypes.append(present_residues[present_index][1])
present_index += 1
else:
new_aatypes.append(20) # Unknown amino acid type.
entity_poly_seq[chain_id] = (new_residue_indices, new_aatypes)
return entity_poly_seq
def _create_mmcif_string(mmcif_dict: Dict[str, Any]) -> str:
"""Converts mmCIF dictionary into mmCIF string."""
mmcifio = MMCIFIO()
mmcifio.set_dict(mmcif_dict)
with io.StringIO() as file_handle:
mmcifio.save(file_handle)
return file_handle.getvalue()
...@@ -82,16 +82,55 @@ class ProteinTest(parameterized.TestCase): ...@@ -82,16 +82,55 @@ class ProteinTest(parameterized.TestCase):
np.testing.assert_array_almost_equal( np.testing.assert_array_almost_equal(
prot_reconstr.b_factors, prot.b_factors) prot_reconstr.b_factors, prot.b_factors)
@parameterized.named_parameters(
dict(
testcase_name='glucagon',
pdb_file='glucagon.pdb',
model_type='Monomer',
),
dict(testcase_name='7bui', pdb_file='5nmu.pdb', model_type='Multimer'),
)
def test_to_mmcif(self, pdb_file, model_type):
with open(
os.path.join(
absltest.get_default_test_srcdir(), TEST_DATA_DIR, pdb_file
)
) as f:
pdb_string = f.read()
prot = protein.from_pdb_string(pdb_string)
file_id = 'test'
mmcif_string = protein.to_mmcif(prot, file_id, model_type)
prot_reconstr = protein.from_mmcif_string(mmcif_string)
np.testing.assert_array_equal(prot_reconstr.aatype, prot.aatype)
np.testing.assert_array_almost_equal(
prot_reconstr.atom_positions, prot.atom_positions
)
np.testing.assert_array_almost_equal(
prot_reconstr.atom_mask, prot.atom_mask
)
np.testing.assert_array_equal(
prot_reconstr.residue_index, prot.residue_index
)
np.testing.assert_array_equal(prot_reconstr.chain_index, prot.chain_index)
np.testing.assert_array_almost_equal(
prot_reconstr.b_factors, prot.b_factors
)
def test_ideal_atom_mask(self): def test_ideal_atom_mask(self):
with open( with open(
os.path.join(absltest.get_default_test_srcdir(), TEST_DATA_DIR, os.path.join(
'2rbg.pdb')) as f: absltest.get_default_test_srcdir(), TEST_DATA_DIR, '2rbg.pdb'
)
) as f:
pdb_string = f.read() pdb_string = f.read()
prot = protein.from_pdb_string(pdb_string) prot = protein.from_pdb_string(pdb_string)
ideal_mask = protein.ideal_atom_mask(prot) ideal_mask = protein.ideal_atom_mask(prot)
non_ideal_residues = set([102] + list(range(127, 286))) non_ideal_residues = set([102] + list(range(127, 286)))
for i, (res, atom_mask) in enumerate( for i, (res, atom_mask) in enumerate(
zip(prot.residue_index, prot.atom_mask)): zip(prot.residue_index, prot.atom_mask)
):
if res in non_ideal_residues: if res in non_ideal_residues:
self.assertFalse(np.all(atom_mask == ideal_mask[i]), msg=f'{res}') self.assertFalse(np.all(atom_mask == ideal_mask[i]), msg=f'{res}')
else: else:
......
...@@ -17,7 +17,7 @@ ...@@ -17,7 +17,7 @@
import collections import collections
import functools import functools
import os import os
from typing import List, Mapping, Tuple from typing import Final, List, Mapping, Tuple
import numpy as np import numpy as np
import tree import tree
...@@ -609,6 +609,35 @@ restype_1to3 = { ...@@ -609,6 +609,35 @@ restype_1to3 = {
'V': 'VAL', 'V': 'VAL',
} }
PROTEIN_CHAIN: Final[str] = 'polypeptide(L)'
POLYMER_CHAIN: Final[str] = 'polymer'
def atom_id_to_type(atom_id: str) -> str:
"""Convert atom ID to atom type, works only for standard protein residues.
Args:
atom_id: Atom ID to be converted.
Returns:
String corresponding to atom type.
Raises:
ValueError: If atom ID not recognized.
"""
if atom_id.startswith('C'):
return 'C'
elif atom_id.startswith('N'):
return 'N'
elif atom_id.startswith('O'):
return 'O'
elif atom_id.startswith('H'):
return 'H'
elif atom_id.startswith('S'):
return 'S'
raise ValueError('Atom ID not recognized.')
# NB: restype_3to1 differs from Bio.PDB.protein_letters_3to1 by being a simple # NB: restype_3to1 differs from Bio.PDB.protein_letters_3to1 by being a simple
# 1-to-1 mapping of 3 letter names to one letter names. The latter contains # 1-to-1 mapping of 3 letter names to one letter names. The latter contains
......
This source diff could not be displayed because it is too large. You can view the blob instead.
HEADER HORMONE 17-OCT-77 1GCN
TITLE X-RAY ANALYSIS OF GLUCAGON AND ITS RELATIONSHIP TO RECEPTOR
TITLE 2 BINDING
COMPND MOL_ID: 1;
COMPND 2 MOLECULE: GLUCAGON;
COMPND 3 CHAIN: A;
COMPND 4 ENGINEERED: YES
SOURCE MOL_ID: 1;
SOURCE 2 ORGANISM_SCIENTIFIC: SUS SCROFA;
SOURCE 3 ORGANISM_COMMON: PIG;
SOURCE 4 ORGANISM_TAXID: 9823
KEYWDS HORMONE
EXPDTA X-RAY DIFFRACTION
AUTHOR T.L.BLUNDELL,K.SASAKI,S.DOCKERILL,I.J.TICKLE
REVDAT 6 24-FEB-09 1GCN 1 VERSN
REVDAT 5 30-SEP-83 1GCN 1 REVDAT
REVDAT 4 31-DEC-80 1GCN 1 REMARK
REVDAT 3 22-OCT-79 1GCN 3 ATOM
REVDAT 2 29-AUG-79 1GCN 3 CRYST1
REVDAT 1 28-NOV-77 1GCN 0
JRNL AUTH K.SASAKI,S.DOCKERILL,D.A.ADAMIAK,I.J.TICKLE,
JRNL AUTH 2 T.BLUNDELL
JRNL TITL X-RAY ANALYSIS OF GLUCAGON AND ITS RELATIONSHIP TO
JRNL TITL 2 RECEPTOR BINDING.
JRNL REF NATURE V. 257 751 1975
JRNL REFN ISSN 0028-0836
JRNL PMID 171582
JRNL DOI 10.1038/257751A0
REMARK 1
REMARK 1 REFERENCE 1
REMARK 1 EDIT M.O.DAYHOFF
REMARK 1 REF ATLAS OF PROTEIN SEQUENCE V. 5 125 1976
REMARK 1 REF 2 AND STRUCTURE,SUPPLEMENT 2
REMARK 1 PUBL NATIONAL BIOMEDICAL RESEARCH FOUNDATION, SILVER
REMARK 1 PUBL 2 SPRING,MD.
REMARK 1 REFN ISSN 0-912466-05-7
REMARK 2
REMARK 2 RESOLUTION. 3.00 ANGSTROMS.
REMARK 3
REMARK 3 REFINEMENT.
REMARK 3 PROGRAM : NULL
REMARK 3 AUTHORS : NULL
REMARK 3
REMARK 3 DATA USED IN REFINEMENT.
REMARK 3 RESOLUTION RANGE HIGH (ANGSTROMS) : 3.00
REMARK 3 RESOLUTION RANGE LOW (ANGSTROMS) : NULL
REMARK 3 DATA CUTOFF (SIGMA(F)) : NULL
REMARK 3 DATA CUTOFF HIGH (ABS(F)) : NULL
REMARK 3 DATA CUTOFF LOW (ABS(F)) : NULL
REMARK 3 COMPLETENESS (WORKING+TEST) (%) : NULL
REMARK 3 NUMBER OF REFLECTIONS : NULL
REMARK 3
REMARK 3 FIT TO DATA USED IN REFINEMENT.
REMARK 3 CROSS-VALIDATION METHOD : NULL
REMARK 3 FREE R VALUE TEST SET SELECTION : NULL
REMARK 3 R VALUE (WORKING SET) : NULL
REMARK 3 FREE R VALUE : NULL
REMARK 3 FREE R VALUE TEST SET SIZE (%) : NULL
REMARK 3 FREE R VALUE TEST SET COUNT : NULL
REMARK 3 ESTIMATED ERROR OF FREE R VALUE : NULL
REMARK 3
REMARK 3 FIT IN THE HIGHEST RESOLUTION BIN.
REMARK 3 TOTAL NUMBER OF BINS USED : NULL
REMARK 3 BIN RESOLUTION RANGE HIGH (A) : NULL
REMARK 3 BIN RESOLUTION RANGE LOW (A) : NULL
REMARK 3 BIN COMPLETENESS (WORKING+TEST) (%) : NULL
REMARK 3 REFLECTIONS IN BIN (WORKING SET) : NULL
REMARK 3 BIN R VALUE (WORKING SET) : NULL
REMARK 3 BIN FREE R VALUE : NULL
REMARK 3 BIN FREE R VALUE TEST SET SIZE (%) : NULL
REMARK 3 BIN FREE R VALUE TEST SET COUNT : NULL
REMARK 3 ESTIMATED ERROR OF BIN FREE R VALUE : NULL
REMARK 3
REMARK 3 NUMBER OF NON-HYDROGEN ATOMS USED IN REFINEMENT.
REMARK 3 PROTEIN ATOMS : 246
REMARK 3 NUCLEIC ACID ATOMS : 0
REMARK 3 HETEROGEN ATOMS : 0
REMARK 3 SOLVENT ATOMS : 0
REMARK 3
REMARK 3 B VALUES.
REMARK 3 FROM WILSON PLOT (A**2) : NULL
REMARK 3 MEAN B VALUE (OVERALL, A**2) : NULL
REMARK 3 OVERALL ANISOTROPIC B VALUE.
REMARK 3 B11 (A**2) : NULL
REMARK 3 B22 (A**2) : NULL
REMARK 3 B33 (A**2) : NULL
REMARK 3 B12 (A**2) : NULL
REMARK 3 B13 (A**2) : NULL
REMARK 3 B23 (A**2) : NULL
REMARK 3
REMARK 3 ESTIMATED COORDINATE ERROR.
REMARK 3 ESD FROM LUZZATI PLOT (A) : NULL
REMARK 3 ESD FROM SIGMAA (A) : NULL
REMARK 3 LOW RESOLUTION CUTOFF (A) : NULL
REMARK 3
REMARK 3 CROSS-VALIDATED ESTIMATED COORDINATE ERROR.
REMARK 3 ESD FROM C-V LUZZATI PLOT (A) : NULL
REMARK 3 ESD FROM C-V SIGMAA (A) : NULL
REMARK 3
REMARK 3 RMS DEVIATIONS FROM IDEAL VALUES.
REMARK 3 BOND LENGTHS (A) : NULL
REMARK 3 BOND ANGLES (DEGREES) : NULL
REMARK 3 DIHEDRAL ANGLES (DEGREES) : NULL
REMARK 3 IMPROPER ANGLES (DEGREES) : NULL
REMARK 3
REMARK 3 ISOTROPIC THERMAL MODEL : NULL
REMARK 3
REMARK 3 ISOTROPIC THERMAL FACTOR RESTRAINTS. RMS SIGMA
REMARK 3 MAIN-CHAIN BOND (A**2) : NULL ; NULL
REMARK 3 MAIN-CHAIN ANGLE (A**2) : NULL ; NULL
REMARK 3 SIDE-CHAIN BOND (A**2) : NULL ; NULL
REMARK 3 SIDE-CHAIN ANGLE (A**2) : NULL ; NULL
REMARK 3
REMARK 3 NCS MODEL : NULL
REMARK 3
REMARK 3 NCS RESTRAINTS. RMS SIGMA/WEIGHT
REMARK 3 GROUP 1 POSITIONAL (A) : NULL ; NULL
REMARK 3 GROUP 1 B-FACTOR (A**2) : NULL ; NULL
REMARK 3
REMARK 3 PARAMETER FILE 1 : NULL
REMARK 3 TOPOLOGY FILE 1 : NULL
REMARK 3
REMARK 3 OTHER REFINEMENT REMARKS: NULL
REMARK 4
REMARK 4 1GCN COMPLIES WITH FORMAT V. 3.15, 01-DEC-08
REMARK 100
REMARK 100 THIS ENTRY HAS BEEN PROCESSED BY BNL.
REMARK 200
REMARK 200 EXPERIMENTAL DETAILS
REMARK 200 EXPERIMENT TYPE : X-RAY DIFFRACTION
REMARK 200 DATE OF DATA COLLECTION : NULL
REMARK 200 TEMPERATURE (KELVIN) : NULL
REMARK 200 PH : NULL
REMARK 200 NUMBER OF CRYSTALS USED : NULL
REMARK 200
REMARK 200 SYNCHROTRON (Y/N) : NULL
REMARK 200 RADIATION SOURCE : NULL
REMARK 200 BEAMLINE : NULL
REMARK 200 X-RAY GENERATOR MODEL : NULL
REMARK 200 MONOCHROMATIC OR LAUE (M/L) : NULL
REMARK 200 WAVELENGTH OR RANGE (A) : NULL
REMARK 200 MONOCHROMATOR : NULL
REMARK 200 OPTICS : NULL
REMARK 200
REMARK 200 DETECTOR TYPE : NULL
REMARK 200 DETECTOR MANUFACTURER : NULL
REMARK 200 INTENSITY-INTEGRATION SOFTWARE : NULL
REMARK 200 DATA SCALING SOFTWARE : NULL
REMARK 200
REMARK 200 NUMBER OF UNIQUE REFLECTIONS : NULL
REMARK 200 RESOLUTION RANGE HIGH (A) : NULL
REMARK 200 RESOLUTION RANGE LOW (A) : NULL
REMARK 200 REJECTION CRITERIA (SIGMA(I)) : NULL
REMARK 200
REMARK 200 OVERALL.
REMARK 200 COMPLETENESS FOR RANGE (%) : NULL
REMARK 200 DATA REDUNDANCY : NULL
REMARK 200 R MERGE (I) : NULL
REMARK 200 R SYM (I) : NULL
REMARK 200 <I/SIGMA(I)> FOR THE DATA SET : NULL
REMARK 200
REMARK 200 IN THE HIGHEST RESOLUTION SHELL.
REMARK 200 HIGHEST RESOLUTION SHELL, RANGE HIGH (A) : NULL
REMARK 200 HIGHEST RESOLUTION SHELL, RANGE LOW (A) : NULL
REMARK 200 COMPLETENESS FOR SHELL (%) : NULL
REMARK 200 DATA REDUNDANCY IN SHELL : NULL
REMARK 200 R MERGE FOR SHELL (I) : NULL
REMARK 200 R SYM FOR SHELL (I) : NULL
REMARK 200 <I/SIGMA(I)> FOR SHELL : NULL
REMARK 200
REMARK 200 DIFFRACTION PROTOCOL: NULL
REMARK 200 METHOD USED TO DETERMINE THE STRUCTURE: NULL
REMARK 200 SOFTWARE USED: NULL
REMARK 200 STARTING MODEL: NULL
REMARK 200
REMARK 200 REMARK: NULL
REMARK 280
REMARK 280 CRYSTAL
REMARK 280 SOLVENT CONTENT, VS (%): 50.74
REMARK 280 MATTHEWS COEFFICIENT, VM (ANGSTROMS**3/DA): 2.50
REMARK 280
REMARK 280 CRYSTALLIZATION CONDITIONS: NULL
REMARK 290
REMARK 290 CRYSTALLOGRAPHIC SYMMETRY
REMARK 290 SYMMETRY OPERATORS FOR SPACE GROUP: P 21 3
REMARK 290
REMARK 290 SYMOP SYMMETRY
REMARK 290 NNNMMM OPERATOR
REMARK 290 1555 X,Y,Z
REMARK 290 2555 -X+1/2,-Y,Z+1/2
REMARK 290 3555 -X,Y+1/2,-Z+1/2
REMARK 290 4555 X+1/2,-Y+1/2,-Z
REMARK 290 5555 Z,X,Y
REMARK 290 6555 Z+1/2,-X+1/2,-Y
REMARK 290 7555 -Z+1/2,-X,Y+1/2
REMARK 290 8555 -Z,X+1/2,-Y+1/2
REMARK 290 9555 Y,Z,X
REMARK 290 10555 -Y,Z+1/2,-X+1/2
REMARK 290 11555 Y+1/2,-Z+1/2,-X
REMARK 290 12555 -Y+1/2,-Z,X+1/2
REMARK 290
REMARK 290 WHERE NNN -> OPERATOR NUMBER
REMARK 290 MMM -> TRANSLATION VECTOR
REMARK 290
REMARK 290 CRYSTALLOGRAPHIC SYMMETRY TRANSFORMATIONS
REMARK 290 THE FOLLOWING TRANSFORMATIONS OPERATE ON THE ATOM/HETATM
REMARK 290 RECORDS IN THIS ENTRY TO PRODUCE CRYSTALLOGRAPHICALLY
REMARK 290 RELATED MOLECULES.
REMARK 290 SMTRY1 1 1.000000 0.000000 0.000000 0.00000
REMARK 290 SMTRY2 1 0.000000 1.000000 0.000000 0.00000
REMARK 290 SMTRY3 1 0.000000 0.000000 1.000000 0.00000
REMARK 290 SMTRY1 2 -1.000000 0.000000 0.000000 23.55000
REMARK 290 SMTRY2 2 0.000000 -1.000000 0.000000 0.00000
REMARK 290 SMTRY3 2 0.000000 0.000000 1.000000 23.55000
REMARK 290 SMTRY1 3 -1.000000 0.000000 0.000000 0.00000
REMARK 290 SMTRY2 3 0.000000 1.000000 0.000000 23.55000
REMARK 290 SMTRY3 3 0.000000 0.000000 -1.000000 23.55000
REMARK 290 SMTRY1 4 1.000000 0.000000 0.000000 23.55000
REMARK 290 SMTRY2 4 0.000000 -1.000000 0.000000 23.55000
REMARK 290 SMTRY3 4 0.000000 0.000000 -1.000000 0.00000
REMARK 290 SMTRY1 5 0.000000 0.000000 1.000000 0.00000
REMARK 290 SMTRY2 5 1.000000 0.000000 0.000000 0.00000
REMARK 290 SMTRY3 5 0.000000 1.000000 0.000000 0.00000
REMARK 290 SMTRY1 6 0.000000 0.000000 1.000000 23.55000
REMARK 290 SMTRY2 6 -1.000000 0.000000 0.000000 23.55000
REMARK 290 SMTRY3 6 0.000000 -1.000000 0.000000 0.00000
REMARK 290 SMTRY1 7 0.000000 0.000000 -1.000000 23.55000
REMARK 290 SMTRY2 7 -1.000000 0.000000 0.000000 0.00000
REMARK 290 SMTRY3 7 0.000000 1.000000 0.000000 23.55000
REMARK 290 SMTRY1 8 0.000000 0.000000 -1.000000 0.00000
REMARK 290 SMTRY2 8 1.000000 0.000000 0.000000 23.55000
REMARK 290 SMTRY3 8 0.000000 -1.000000 0.000000 23.55000
REMARK 290 SMTRY1 9 0.000000 1.000000 0.000000 0.00000
REMARK 290 SMTRY2 9 0.000000 0.000000 1.000000 0.00000
REMARK 290 SMTRY3 9 1.000000 0.000000 0.000000 0.00000
REMARK 290 SMTRY1 10 0.000000 -1.000000 0.000000 0.00000
REMARK 290 SMTRY2 10 0.000000 0.000000 1.000000 23.55000
REMARK 290 SMTRY3 10 -1.000000 0.000000 0.000000 23.55000
REMARK 290 SMTRY1 11 0.000000 1.000000 0.000000 23.55000
REMARK 290 SMTRY2 11 0.000000 0.000000 -1.000000 23.55000
REMARK 290 SMTRY3 11 -1.000000 0.000000 0.000000 0.00000
REMARK 290 SMTRY1 12 0.000000 -1.000000 0.000000 23.55000
REMARK 290 SMTRY2 12 0.000000 0.000000 -1.000000 0.00000
REMARK 290 SMTRY3 12 1.000000 0.000000 0.000000 23.55000
REMARK 290
REMARK 290 REMARK: NULL
REMARK 300
REMARK 300 BIOMOLECULE: 1
REMARK 300 SEE REMARK 350 FOR THE AUTHOR PROVIDED AND/OR PROGRAM
REMARK 300 GENERATED ASSEMBLY INFORMATION FOR THE STRUCTURE IN
REMARK 300 THIS ENTRY. THE REMARK MAY ALSO PROVIDE INFORMATION ON
REMARK 300 BURIED SURFACE AREA.
REMARK 350
REMARK 350 COORDINATES FOR A COMPLETE MULTIMER REPRESENTING THE KNOWN
REMARK 350 BIOLOGICALLY SIGNIFICANT OLIGOMERIZATION STATE OF THE
REMARK 350 MOLECULE CAN BE GENERATED BY APPLYING BIOMT TRANSFORMATIONS
REMARK 350 GIVEN BELOW. BOTH NON-CRYSTALLOGRAPHIC AND
REMARK 350 CRYSTALLOGRAPHIC OPERATIONS ARE GIVEN.
REMARK 350
REMARK 350 BIOMOLECULE: 1
REMARK 350 AUTHOR DETERMINED BIOLOGICAL UNIT: MONOMERIC
REMARK 350 APPLY THE FOLLOWING TO CHAINS: A
REMARK 350 BIOMT1 1 1.000000 0.000000 0.000000 0.00000
REMARK 350 BIOMT2 1 0.000000 1.000000 0.000000 0.00000
REMARK 350 BIOMT3 1 0.000000 0.000000 1.000000 0.00000
REMARK 500
REMARK 500 GEOMETRY AND STEREOCHEMISTRY
REMARK 500 SUBTOPIC: COVALENT BOND LENGTHS
REMARK 500
REMARK 500 THE STEREOCHEMICAL PARAMETERS OF THE FOLLOWING RESIDUES
REMARK 500 HAVE VALUES WHICH DEVIATE FROM EXPECTED VALUES BY MORE
REMARK 500 THAN 6*RMSD (M=MODEL NUMBER; RES=RESIDUE NAME; C=CHAIN
REMARK 500 IDENTIFIER; SSEQ=SEQUENCE NUMBER; I=INSERTION CODE).
REMARK 500
REMARK 500 STANDARD TABLE:
REMARK 500 FORMAT: (10X,I3,1X,2(A3,1X,A1,I4,A1,1X,A4,3X),1X,F6.3)
REMARK 500
REMARK 500 EXPECTED VALUES PROTEIN: ENGH AND HUBER, 1999
REMARK 500 EXPECTED VALUES NUCLEIC ACID: CLOWNEY ET AL 1996
REMARK 500
REMARK 500 M RES CSSEQI ATM1 RES CSSEQI ATM2 DEVIATION
REMARK 500 TYR A 10 CZ TYR A 10 OH -0.387
REMARK 500 TRP A 25 CD1 TRP A 25 NE1 0.287
REMARK 500 TRP A 25 NE1 TRP A 25 CE2 0.109
REMARK 500
REMARK 500 REMARK: NULL
REMARK 500
REMARK 500 GEOMETRY AND STEREOCHEMISTRY
REMARK 500 SUBTOPIC: COVALENT BOND ANGLES
REMARK 500
REMARK 500 THE STEREOCHEMICAL PARAMETERS OF THE FOLLOWING RESIDUES
REMARK 500 HAVE VALUES WHICH DEVIATE FROM EXPECTED VALUES BY MORE
REMARK 500 THAN 6*RMSD (M=MODEL NUMBER; RES=RESIDUE NAME; C=CHAIN
REMARK 500 IDENTIFIER; SSEQ=SEQUENCE NUMBER; I=INSERTION CODE).
REMARK 500
REMARK 500 STANDARD TABLE:
REMARK 500 FORMAT: (10X,I3,1X,A3,1X,A1,I4,A1,3(1X,A4,2X),12X,F5.1)
REMARK 500
REMARK 500 EXPECTED VALUES PROTEIN: ENGH AND HUBER, 1999
REMARK 500 EXPECTED VALUES NUCLEIC ACID: CLOWNEY ET AL 1996
REMARK 500
REMARK 500 M RES CSSEQI ATM1 ATM2 ATM3
REMARK 500 TRP A 25 CG - CD1 - NE1 ANGL. DEV. = 6.7 DEGREES
REMARK 500 TRP A 25 CD1 - NE1 - CE2 ANGL. DEV. = -21.5 DEGREES
REMARK 500 TRP A 25 NE1 - CE2 - CZ2 ANGL. DEV. = -11.0 DEGREES
REMARK 500 TRP A 25 NE1 - CE2 - CD2 ANGL. DEV. = 9.6 DEGREES
REMARK 500
REMARK 500 REMARK: NULL
REMARK 500
REMARK 500 GEOMETRY AND STEREOCHEMISTRY
REMARK 500 SUBTOPIC: TORSION ANGLES
REMARK 500
REMARK 500 TORSION ANGLES OUTSIDE THE EXPECTED RAMACHANDRAN REGIONS:
REMARK 500 (M=MODEL NUMBER; RES=RESIDUE NAME; C=CHAIN IDENTIFIER;
REMARK 500 SSEQ=SEQUENCE NUMBER; I=INSERTION CODE).
REMARK 500
REMARK 500 STANDARD TABLE:
REMARK 500 FORMAT:(10X,I3,1X,A3,1X,A1,I4,A1,4X,F7.2,3X,F7.2)
REMARK 500
REMARK 500 EXPECTED VALUES: GJ KLEYWEGT AND TA JONES (1996). PHI/PSI-
REMARK 500 CHOLOGY: RAMACHANDRAN REVISITED. STRUCTURE 4, 1395 - 1400
REMARK 500
REMARK 500 M RES CSSEQI PSI PHI
REMARK 500 SER A 2 -57.57 -21.14
REMARK 500 THR A 5 54.62 -63.85
REMARK 500 SER A 11 9.62 -51.97
REMARK 500 MET A 27 -93.98 -145.30
REMARK 500 ASN A 28 64.02 15.67
REMARK 500
REMARK 500 REMARK: NULL
REMARK 500
REMARK 500 GEOMETRY AND STEREOCHEMISTRY
REMARK 500 SUBTOPIC: PLANAR GROUPS
REMARK 500
REMARK 500 PLANAR GROUPS IN THE FOLLOWING RESIDUES HAVE A TOTAL
REMARK 500 RMS DISTANCE OF ALL ATOMS FROM THE BEST-FIT PLANE
REMARK 500 BY MORE THAN AN EXPECTED VALUE OF 6*RMSD, WITH AN
REMARK 500 RMSD 0.02 ANGSTROMS, OR AT LEAST ONE ATOM HAS
REMARK 500 AN RMSD GREATER THAN THIS VALUE
REMARK 500 (M=MODEL NUMBER; RES=RESIDUE NAME; C=CHAIN IDENTIFIER;
REMARK 500 SSEQ=SEQUENCE NUMBER; I=INSERTION CODE).
REMARK 500
REMARK 500 M RES CSSEQI RMS TYPE
REMARK 500 ASN A 28 0.08 SIDE_CHAIN
REMARK 500
REMARK 500 REMARK: NULL
REMARK 500
REMARK 500 GEOMETRY AND STEREOCHEMISTRY
REMARK 500 SUBTOPIC: MAIN CHAIN PLANARITY
REMARK 500
REMARK 500 THE FOLLOWING RESIDUES HAVE A PSEUDO PLANARITY
REMARK 500 TORSION, C(I) - CA(I) - N(I+1) - O(I), GREATER
REMARK 500 10.0 DEGREES. (M=MODEL NUMBER; RES=RESIDUE NAME;
REMARK 500 C=CHAIN IDENTIFIER; SSEQ=SEQUENCE NUMBER;
REMARK 500 I=INSERTION CODE).
REMARK 500
REMARK 500 M RES CSSEQI ANGLE
REMARK 500 HIS A 1 19.48
REMARK 500 GLN A 3 -15.78
REMARK 500 GLY A 4 -17.23
REMARK 500 THR A 5 -10.38
REMARK 500 PHE A 6 -12.06
REMARK 500 THR A 7 -14.66
REMARK 500 SER A 11 -15.10
REMARK 500 LYS A 12 14.46
REMARK 500 ALA A 19 -10.92
REMARK 500 GLN A 20 -13.40
REMARK 500 VAL A 23 -15.87
REMARK 500 LEU A 26 -14.56
REMARK 500 MET A 27 -16.22
REMARK 500
REMARK 500 REMARK: NULL
DBREF 1GCN A 1 29 UNP P01274 GLUC_PIG 33 61
SEQRES 1 A 29 HIS SER GLN GLY THR PHE THR SER ASP TYR SER LYS TYR
SEQRES 2 A 29 LEU ASP SER ARG ARG ALA GLN ASP PHE VAL GLN TRP LEU
SEQRES 3 A 29 MET ASN THR
HELIX 1 A PHE A 6 LEU A 26 1 21
CRYST1 47.100 47.100 47.100 90.00 90.00 90.00 P 21 3 12
ORIGX1 0.021231 0.000000 0.000000 0.00000
ORIGX2 0.000000 0.021231 0.000000 0.00000
ORIGX3 0.000000 0.000000 0.021231 0.00000
SCALE1 0.021231 0.000000 0.000000 0.00000
SCALE2 0.000000 0.021231 0.000000 0.00000
SCALE3 0.000000 0.000000 0.021231 0.00000
ATOM 1 N HIS A 1 49.668 24.248 10.436 1.00 25.00 N
ATOM 2 CA HIS A 1 50.197 25.578 10.784 1.00 16.00 C
ATOM 3 C HIS A 1 49.169 26.701 10.917 1.00 16.00 C
ATOM 4 O HIS A 1 48.241 26.524 11.749 1.00 16.00 O
ATOM 5 CB HIS A 1 51.312 26.048 9.843 1.00 16.00 C
ATOM 6 CG HIS A 1 50.958 26.068 8.340 1.00 16.00 C
ATOM 7 ND1 HIS A 1 49.636 26.144 7.860 1.00 16.00 N
ATOM 8 CD2 HIS A 1 51.797 26.043 7.286 1.00 16.00 C
ATOM 9 CE1 HIS A 1 49.691 26.152 6.454 1.00 17.00 C
ATOM 10 NE2 HIS A 1 51.046 26.090 6.098 1.00 17.00 N
ATOM 11 N SER A 2 49.788 27.850 10.784 1.00 16.00 N
ATOM 12 CA SER A 2 49.138 29.147 10.620 1.00 15.00 C
ATOM 13 C SER A 2 47.713 29.006 10.110 1.00 15.00 C
ATOM 14 O SER A 2 46.740 29.251 10.864 1.00 15.00 O
ATOM 15 CB SER A 2 49.875 29.930 9.569 1.00 16.00 C
ATOM 16 OG SER A 2 49.145 31.057 9.176 1.00 19.00 O
ATOM 17 N GLN A 3 47.620 28.367 8.973 1.00 15.00 N
ATOM 18 CA GLN A 3 46.287 28.193 8.308 1.00 14.00 C
ATOM 19 C GLN A 3 45.406 27.172 8.963 1.00 14.00 C
ATOM 20 O GLN A 3 44.198 27.508 9.014 1.00 14.00 O
ATOM 21 CB GLN A 3 46.489 27.963 6.806 1.00 18.00 C
ATOM 22 CG GLN A 3 45.138 27.800 6.111 1.00 21.00 C
ATOM 23 CD GLN A 3 45.304 27.952 4.603 1.00 24.00 C
ATOM 24 OE1 GLN A 3 46.432 28.202 4.112 1.00 24.00 O
ATOM 25 NE2 GLN A 3 44.233 27.647 3.897 1.00 26.00 N
ATOM 26 N GLY A 4 46.014 26.394 9.871 1.00 14.00 N
ATOM 27 CA GLY A 4 45.422 25.287 10.680 1.00 14.00 C
ATOM 28 C GLY A 4 43.892 25.215 10.719 1.00 14.00 C
ATOM 29 O GLY A 4 43.287 26.155 11.288 1.00 14.00 O
ATOM 30 N THR A 5 43.406 23.993 10.767 1.00 14.00 N
ATOM 31 CA THR A 5 42.004 23.642 10.443 1.00 12.00 C
ATOM 32 C THR A 5 40.788 24.146 11.252 1.00 12.00 C
ATOM 33 O THR A 5 39.804 23.384 11.410 1.00 12.00 O
ATOM 34 CB THR A 5 41.934 22.202 9.889 1.00 14.00 C
ATOM 35 OG1 THR A 5 41.080 21.317 10.609 1.00 15.00 O
ATOM 36 CG2 THR A 5 43.317 21.556 9.849 1.00 15.00 C
ATOM 37 N PHE A 6 40.628 25.463 11.441 1.00 12.00 N
ATOM 38 CA PHE A 6 39.381 25.950 12.104 1.00 12.00 C
ATOM 39 C PHE A 6 38.156 25.684 11.232 1.00 12.00 C
ATOM 40 O PHE A 6 37.231 25.002 11.719 1.00 12.00 O
ATOM 41 CB PHE A 6 39.407 27.425 12.584 1.00 12.00 C
ATOM 42 CG PHE A 6 38.187 27.923 13.430 1.00 12.00 C
ATOM 43 CD1 PHE A 6 36.889 27.518 13.163 1.00 12.00 C
ATOM 44 CD2 PHE A 6 38.386 28.862 14.419 1.00 12.00 C
ATOM 45 CE1 PHE A 6 35.813 27.967 13.909 1.00 12.00 C
ATOM 46 CE2 PHE A 6 37.306 29.328 15.177 1.00 12.00 C
ATOM 47 CZ PHE A 6 36.019 28.871 14.928 1.00 12.00 C
ATOM 48 N THR A 7 38.341 25.794 9.956 1.00 12.00 N
ATOM 49 CA THR A 7 37.249 25.666 8.991 1.00 12.00 C
ATOM 50 C THR A 7 36.324 24.452 9.101 1.00 12.00 C
ATOM 51 O THR A 7 35.111 24.637 9.387 1.00 12.00 O
ATOM 52 CB THR A 7 37.884 25.743 7.628 1.00 13.00 C
ATOM 53 OG1 THR A 7 37.940 27.122 7.317 1.00 14.00 O
ATOM 54 CG2 THR A 7 37.073 25.003 6.585 1.00 14.00 C
ATOM 55 N SER A 8 36.964 23.356 9.442 1.00 12.00 N
ATOM 56 CA SER A 8 36.286 22.063 9.486 1.00 12.00 C
ATOM 57 C SER A 8 35.575 21.813 10.813 1.00 11.00 C
ATOM 58 O SER A 8 35.203 20.650 11.111 1.00 10.00 O
ATOM 59 CB SER A 8 37.291 20.958 9.189 1.00 16.00 C
ATOM 60 OG SER A 8 37.917 21.247 7.943 1.00 20.00 O
ATOM 61 N ASP A 9 35.723 22.783 11.694 1.00 10.00 N
ATOM 62 CA ASP A 9 35.004 22.803 12.977 1.00 10.00 C
ATOM 63 C ASP A 9 33.532 23.121 12.749 1.00 10.00 C
ATOM 64 O ASP A 9 32.645 22.360 13.210 1.00 10.00 O
ATOM 65 CB ASP A 9 35.556 23.874 13.919 1.00 11.00 C
ATOM 66 CG ASP A 9 36.280 23.230 15.096 1.00 13.00 C
ATOM 67 OD1 ASP A 9 36.088 22.010 15.324 1.00 16.00 O
ATOM 68 OD2 ASP A 9 36.821 23.974 15.951 1.00 16.00 O
ATOM 69 N TYR A 10 33.316 24.220 12.040 1.00 10.00 N
ATOM 70 CA TYR A 10 31.967 24.742 11.748 1.00 10.00 C
ATOM 71 C TYR A 10 31.203 23.973 10.685 1.00 10.00 C
ATOM 72 O TYR A 10 29.980 23.772 10.885 1.00 10.00 O
ATOM 73 CB TYR A 10 31.951 26.230 11.367 1.00 10.00 C
ATOM 74 CG TYR A 10 30.613 26.678 10.713 1.00 10.00 C
ATOM 75 CD1 TYR A 10 30.563 26.886 9.350 1.00 10.00 C
ATOM 76 CD2 TYR A 10 29.463 26.824 11.461 1.00 10.00 C
ATOM 77 CE1 TYR A 10 29.377 27.275 8.733 1.00 10.00 C
ATOM 78 CE2 TYR A 10 28.272 27.214 10.848 1.00 10.00 C
ATOM 79 CZ TYR A 10 28.226 27.452 9.483 1.00 10.00 C
ATOM 80 OH TYR A 10 27.365 27.683 9.060 1.00 11.00 O
ATOM 81 N SER A 11 31.796 23.909 9.491 1.00 10.00 N
ATOM 82 CA SER A 11 31.146 23.418 8.250 1.00 10.00 C
ATOM 83 C SER A 11 30.463 22.048 8.303 1.00 10.00 C
ATOM 84 O SER A 11 29.615 21.759 7.422 1.00 10.00 O
ATOM 85 CB SER A 11 32.004 23.615 6.998 1.00 14.00 C
ATOM 86 OG SER A 11 32.013 24.995 6.632 1.00 19.00 O
ATOM 87 N LYS A 12 30.402 21.619 9.544 1.00 10.00 N
ATOM 88 CA LYS A 12 29.792 20.460 10.189 1.00 9.00 C
ATOM 89 C LYS A 12 28.494 20.817 10.932 1.00 9.00 C
ATOM 90 O LYS A 12 27.597 19.943 10.980 1.00 9.00 O
ATOM 91 CB LYS A 12 30.811 20.013 11.224 1.00 10.00 C
ATOM 92 CG LYS A 12 30.482 18.661 11.833 1.00 14.00 C
ATOM 93 CD LYS A 12 31.413 18.365 12.999 1.00 18.00 C
ATOM 94 CE LYS A 12 31.243 16.937 13.498 1.00 22.00 C
ATOM 95 NZ LYS A 12 32.121 16.717 14.652 1.00 26.00 N
ATOM 96 N TYR A 13 28.583 21.742 11.894 1.00 9.00 N
ATOM 97 CA TYR A 13 27.396 22.283 12.612 1.00 8.00 C
ATOM 98 C TYR A 13 26.214 22.497 11.670 1.00 8.00 C
ATOM 99 O TYR A 13 25.037 22.245 12.029 1.00 8.00 O
ATOM 100 CB TYR A 13 27.730 23.578 13.385 1.00 8.00 C
ATOM 101 CG TYR A 13 26.516 24.500 13.692 1.00 8.00 C
ATOM 102 CD1 TYR A 13 25.798 24.377 14.867 1.00 8.00 C
ATOM 103 CD2 TYR A 13 26.185 25.498 12.796 1.00 8.00 C
ATOM 104 CE1 TYR A 13 24.713 25.228 15.120 1.00 8.00 C
ATOM 105 CE2 TYR A 13 25.108 26.342 13.035 1.00 8.00 C
ATOM 106 CZ TYR A 13 24.370 26.210 14.196 1.00 8.00 C
ATOM 107 OH TYR A 13 23.202 26.933 14.347 1.00 10.00 O
ATOM 108 N LEU A 14 26.522 22.993 10.494 1.00 8.00 N
ATOM 109 CA LEU A 14 25.461 23.263 9.523 1.00 8.00 C
ATOM 110 C LEU A 14 24.912 21.978 8.907 1.00 8.00 C
ATOM 111 O LEU A 14 24.122 22.025 7.933 1.00 8.00 O
ATOM 112 CB LEU A 14 25.923 24.242 8.447 1.00 13.00 C
ATOM 113 CG LEU A 14 25.064 25.509 8.412 1.00 19.00 C
ATOM 114 CD1 LEU A 14 25.564 26.496 7.505 1.00 25.00 C
ATOM 115 CD2 LEU A 14 23.582 25.209 8.199 1.00 25.00 C
ATOM 116 N ASP A 15 25.556 20.886 9.263 1.00 8.00 N
ATOM 117 CA ASP A 15 25.075 19.552 8.885 1.00 8.00 C
ATOM 118 C ASP A 15 24.208 19.002 10.009 1.00 8.00 C
ATOM 119 O ASP A 15 23.550 17.940 9.861 1.00 8.00 O
ATOM 120 CB ASP A 15 26.246 18.601 8.644 1.00 11.00 C
ATOM 121 CG ASP A 15 26.260 18.121 7.196 1.00 16.00 C
ATOM 122 OD1 ASP A 15 26.021 18.946 6.280 1.00 21.00 O
ATOM 123 OD2 ASP A 15 26.732 16.984 6.946 1.00 21.00 O
ATOM 124 N SER A 16 24.015 19.861 10.986 1.00 8.00 N
ATOM 125 CA SER A 16 23.180 19.548 12.149 1.00 7.00 C
ATOM 126 C SER A 16 21.923 20.414 12.167 1.00 7.00 C
ATOM 127 O SER A 16 20.841 19.941 12.598 1.00 7.00 O
ATOM 128 CB SER A 16 23.981 19.746 13.437 1.00 9.00 C
ATOM 129 OG SER A 16 23.327 19.102 14.524 1.00 11.00 O
ATOM 130 N ARG A 17 22.037 21.605 11.597 1.00 7.00 N
ATOM 131 CA ARG A 17 20.875 22.504 11.583 1.00 6.00 C
ATOM 132 C ARG A 17 19.868 22.156 10.491 1.00 6.00 C
ATOM 133 O ARG A 17 18.665 22.015 10.809 1.00 6.00 O
ATOM 134 CB ARG A 17 21.214 23.997 11.557 1.00 7.00 C
ATOM 135 CG ARG A 17 20.010 24.800 12.063 1.00 9.00 C
ATOM 136 CD ARG A 17 19.570 25.929 11.132 1.00 11.00 C
ATOM 137 NE ARG A 17 20.149 27.218 11.537 1.00 12.00 N
ATOM 138 CZ ARG A 17 19.828 28.351 10.936 1.00 13.00 C
ATOM 139 NH1 ARG A 17 19.319 28.304 9.720 1.00 14.00 N
ATOM 140 NH2 ARG A 17 20.351 29.485 11.362 1.00 14.00 N
ATOM 141 N ARG A 18 20.378 21.725 9.348 1.00 6.00 N
ATOM 142 CA ARG A 18 19.530 21.258 8.235 1.00 5.00 C
ATOM 143 C ARG A 18 19.148 19.796 8.478 1.00 5.00 C
ATOM 144 O ARG A 18 18.326 19.189 7.741 1.00 5.00 O
ATOM 145 CB ARG A 18 20.237 21.481 6.888 1.00 8.00 C
ATOM 146 CG ARG A 18 19.384 21.236 5.634 1.00 9.00 C
ATOM 147 CD ARG A 18 19.623 19.860 5.005 1.00 11.00 C
ATOM 148 NE ARG A 18 20.029 19.997 3.600 1.00 12.00 N
ATOM 149 CZ ARG A 18 19.398 19.415 2.597 1.00 13.00 C
ATOM 150 NH1 ARG A 18 18.483 18.493 2.835 1.00 14.00 N
ATOM 151 NH2 ARG A 18 19.831 19.597 1.364 1.00 14.00 N
ATOM 152 N ALA A 19 19.560 19.319 9.623 1.00 6.00 N
ATOM 153 CA ALA A 19 19.126 17.991 10.053 1.00 6.00 C
ATOM 154 C ALA A 19 18.002 18.136 11.071 1.00 6.00 C
ATOM 155 O ALA A 19 16.933 17.494 10.922 1.00 7.00 O
ATOM 156 CB ALA A 19 20.285 17.187 10.629 1.00 15.00 C
ATOM 157 N GLN A 20 18.094 19.241 11.783 1.00 7.00 N
ATOM 158 CA GLN A 20 17.013 19.632 12.689 1.00 7.00 C
ATOM 159 C GLN A 20 15.897 20.314 11.905 1.00 7.00 C
ATOM 160 O GLN A 20 14.701 20.031 12.162 1.00 7.00 O
ATOM 161 CB GLN A 20 17.513 20.538 13.821 1.00 11.00 C
ATOM 162 CG GLN A 20 16.699 21.829 13.936 1.00 16.00 C
ATOM 163 CD GLN A 20 16.591 22.277 15.393 1.00 22.00 C
ATOM 164 OE1 GLN A 20 17.533 22.060 16.194 1.00 24.00 O
ATOM 165 NE2 GLN A 20 15.356 22.544 15.773 1.00 24.00 N
ATOM 166 N ASP A 21 16.292 20.724 10.714 1.00 7.00 N
ATOM 167 CA ASP A 21 15.405 21.490 9.835 1.00 7.00 C
ATOM 168 C ASP A 21 14.451 20.565 9.120 1.00 7.00 C
ATOM 169 O ASP A 21 13.245 20.850 8.962 1.00 7.00 O
ATOM 170 CB ASP A 21 16.212 22.278 8.809 1.00 14.00 C
ATOM 171 CG ASP A 21 15.427 23.525 8.413 1.00 21.00 C
ATOM 172 OD1 ASP A 21 15.031 24.298 9.321 1.00 28.00 O
ATOM 173 OD2 ASP A 21 15.316 23.827 7.200 1.00 28.00 O
ATOM 174 N PHE A 22 14.987 19.373 8.843 1.00 7.00 N
ATOM 175 CA PHE A 22 14.216 18.253 8.289 1.00 7.00 C
ATOM 176 C PHE A 22 13.098 17.860 9.246 1.00 7.00 C
ATOM 177 O PHE A 22 11.956 17.556 8.818 1.00 7.00 O
ATOM 178 CB PHE A 22 15.134 17.038 8.105 1.00 8.00 C
ATOM 179 CG PHE A 22 14.349 15.761 7.724 1.00 10.00 C
ATOM 180 CD1 PHE A 22 14.022 15.527 6.410 1.00 12.00 C
ATOM 181 CD2 PHE A 22 13.992 14.842 8.689 1.00 12.00 C
ATOM 182 CE1 PHE A 22 13.302 14.391 6.050 1.00 14.00 C
ATOM 183 CE2 PHE A 22 13.269 13.708 8.340 1.00 14.00 C
ATOM 184 CZ PHE A 22 12.917 13.483 7.018 1.00 16.00 C
ATOM 185 N VAL A 23 13.455 17.883 10.517 1.00 7.00 N
ATOM 186 CA VAL A 23 12.574 17.403 11.589 1.00 7.00 C
ATOM 187 C VAL A 23 11.283 18.205 11.729 1.00 7.00 C
ATOM 188 O VAL A 23 10.233 17.600 12.052 1.00 7.00 O
ATOM 189 CB VAL A 23 13.339 17.278 12.906 1.00 10.00 C
ATOM 190 CG1 VAL A 23 12.441 17.004 14.108 1.00 13.00 C
ATOM 191 CG2 VAL A 23 14.455 16.248 12.794 1.00 13.00 C
ATOM 192 N GLN A 24 11.255 19.253 10.941 1.00 8.00 N
ATOM 193 CA GLN A 24 10.082 20.114 10.818 1.00 8.00 C
ATOM 194 C GLN A 24 9.158 19.638 9.692 1.00 8.00 C
ATOM 195 O GLN A 24 7.959 19.990 9.663 1.00 8.00 O
ATOM 196 CB GLN A 24 10.575 21.521 10.498 1.00 14.00 C
ATOM 197 CG GLN A 24 9.505 22.591 10.661 1.00 20.00 C
ATOM 198 CD GLN A 24 9.964 23.862 9.956 1.00 26.00 C
ATOM 199 OE1 GLN A 24 10.079 24.941 10.587 1.00 32.00 O
ATOM 200 NE2 GLN A 24 10.086 23.739 8.649 1.00 32.00 N
ATOM 201 N TRP A 25 9.723 19.074 8.651 1.00 8.00 N
ATOM 202 CA TRP A 25 8.899 18.676 7.495 1.00 9.00 C
ATOM 203 C TRP A 25 8.118 17.395 7.751 1.00 9.00 C
ATOM 204 O TRP A 25 6.860 17.395 7.725 1.00 9.00 O
ATOM 205 CB TRP A 25 9.761 18.442 6.262 1.00 11.00 C
ATOM 206 CG TRP A 25 8.871 18.331 5.004 1.00 12.00 C
ATOM 207 CD1 TRP A 25 8.097 19.279 4.442 1.00 12.00 C
ATOM 208 CD2 TRP A 25 8.640 17.180 4.249 1.00 12.00 C
ATOM 209 NE1 TRP A 25 7.041 18.780 3.259 1.00 12.00 N
ATOM 210 CE2 TRP A 25 7.873 17.564 3.121 1.00 12.00 C
ATOM 211 CE3 TRP A 25 9.124 15.884 4.378 1.00 12.00 C
ATOM 212 CZ2 TRP A 25 7.726 16.765 2.003 1.00 12.00 C
ATOM 213 CZ3 TRP A 25 8.870 15.038 3.296 1.00 12.00 C
ATOM 214 CH2 TRP A 25 8.216 15.469 2.140 1.00 12.00 C
ATOM 215 N LEU A 26 8.857 16.484 8.346 1.00 9.00 N
ATOM 216 CA LEU A 26 8.377 15.159 8.741 1.00 10.00 C
ATOM 217 C LEU A 26 7.534 15.279 10.012 1.00 11.00 C
ATOM 218 O LEU A 26 6.755 14.347 10.331 1.00 11.00 O
ATOM 219 CB LEU A 26 9.611 14.267 8.924 1.00 10.00 C
ATOM 220 CG LEU A 26 9.342 12.810 9.303 1.00 10.00 C
ATOM 221 CD1 LEU A 26 8.223 12.149 8.505 1.00 10.00 C
ATOM 222 CD2 LEU A 26 10.637 11.982 9.250 1.00 10.00 C
ATOM 223 N MET A 27 7.281 16.544 10.320 1.00 11.00 N
ATOM 224 CA MET A 27 6.446 16.959 11.451 1.00 11.00 C
ATOM 225 C MET A 27 5.607 18.227 11.219 1.00 13.00 C
ATOM 226 O MET A 27 4.823 18.240 10.244 1.00 13.00 O
ATOM 227 CB MET A 27 7.327 17.118 12.679 1.00 11.00 C
ATOM 228 CG MET A 27 6.518 17.289 13.953 1.00 11.00 C
ATOM 229 SD MET A 27 7.301 18.326 15.196 1.00 11.00 S
ATOM 230 CE MET A 27 5.833 18.677 16.178 1.00 11.00 C
ATOM 231 N ASN A 28 6.147 19.366 11.620 1.00 14.00 N
ATOM 232 CA ASN A 28 5.399 20.637 11.728 1.00 14.00 C
ATOM 233 C ASN A 28 3.878 20.587 11.716 1.00 17.00 C
ATOM 234 O ASN A 28 3.252 21.114 10.763 1.00 19.00 O
ATOM 235 CB ASN A 28 5.874 21.774 10.843 1.00 14.00 C
ATOM 236 CG ASN A 28 6.246 22.905 11.791 1.00 14.00 C
ATOM 237 OD1 ASN A 28 6.929 22.629 12.807 1.00 14.00 O
ATOM 238 ND2 ASN A 28 6.271 24.085 11.229 1.00 14.00 N
ATOM 239 N THR A 29 3.391 19.940 12.762 1.00 21.00 N
ATOM 240 CA THR A 29 2.014 19.761 13.283 1.00 21.00 C
ATOM 241 C THR A 29 0.826 19.943 12.332 1.00 23.00 C
ATOM 242 O THR A 29 0.932 19.600 11.133 1.00 30.00 O
ATOM 243 CB THR A 29 1.845 20.667 14.505 1.00 21.00 C
ATOM 244 OG1 THR A 29 1.214 21.893 14.153 1.00 21.00 O
ATOM 245 CG2 THR A 29 3.180 20.968 15.185 1.00 21.00 C
ATOM 246 OXT THR A 29 -0.317 20.109 12.824 1.00 25.00 O
TER 247 THR A 29
MASTER 344 1 0 1 0 0 0 6 246 1 0 3
END
...@@ -315,6 +315,7 @@ def _get_header(parsed_info: MmCIFDict) -> PdbHeader: ...@@ -315,6 +315,7 @@ def _get_header(parsed_info: MmCIFDict) -> PdbHeader:
try: try:
raw_resolution = parsed_info[res_key][0] raw_resolution = parsed_info[res_key][0]
header['resolution'] = float(raw_resolution) header['resolution'] = float(raw_resolution)
break
except ValueError: except ValueError:
logging.debug('Invalid resolution format: %s', parsed_info[res_key]) logging.debug('Invalid resolution format: %s', parsed_info[res_key])
......
...@@ -15,9 +15,7 @@ ...@@ -15,9 +15,7 @@
"""Pairing logic for multimer data pipeline.""" """Pairing logic for multimer data pipeline."""
import collections import collections
import functools from typing import cast, Dict, Iterable, List, Sequence
import string
from typing import Any, Dict, Iterable, List, Sequence
from alphafold.common import residue_constants from alphafold.common import residue_constants
from alphafold.data import pipeline from alphafold.data import pipeline
...@@ -135,7 +133,7 @@ def _create_species_dict(msa_df: pd.DataFrame) -> Dict[bytes, pd.DataFrame]: ...@@ -135,7 +133,7 @@ def _create_species_dict(msa_df: pd.DataFrame) -> Dict[bytes, pd.DataFrame]:
"""Creates mapping from species to msa dataframe of that species.""" """Creates mapping from species to msa dataframe of that species."""
species_lookup = {} species_lookup = {}
for species, species_df in msa_df.groupby('msa_species_identifiers'): for species, species_df in msa_df.groupby('msa_species_identifiers'):
species_lookup[species] = species_df species_lookup[cast(bytes, species)] = species_df
return species_lookup return species_lookup
......
...@@ -449,6 +449,7 @@ def _get_atom_positions( ...@@ -449,6 +449,7 @@ def _get_atom_positions(
mask = np.zeros([residue_constants.atom_type_num], dtype=np.float32) mask = np.zeros([residue_constants.atom_type_num], dtype=np.float32)
res_at_position = mmcif_object.seqres_to_structure[auth_chain_id][res_index] res_at_position = mmcif_object.seqres_to_structure[auth_chain_id][res_index]
if not res_at_position.is_missing: if not res_at_position.is_missing:
assert res_at_position.position is not None
res = chain[(res_at_position.hetflag, res = chain[(res_at_position.hetflag,
res_at_position.position.residue_number, res_at_position.position.residue_number,
res_at_position.position.insertion_code)] res_at_position.position.insertion_code)]
......
...@@ -775,7 +775,7 @@ def compute_atom14_gt( ...@@ -775,7 +775,7 @@ def compute_atom14_gt(
gt_mask = (1. - use_alt) * gt_mask + use_alt * alt_gt_mask gt_mask = (1. - use_alt) * gt_mask + use_alt * alt_gt_mask
gt_positions = (1. - use_alt) * gt_positions + use_alt * alt_gt_positions gt_positions = (1. - use_alt) * gt_positions + use_alt * alt_gt_positions
return gt_positions, alt_gt_mask, alt_naming_is_better return gt_positions, gt_mask, alt_naming_is_better
def backbone_loss(gt_rigid: geometry.Rigid3Array, def backbone_loss(gt_rigid: geometry.Rigid3Array,
......
...@@ -61,9 +61,9 @@ def assert_vectors_equal(vec1: vector.Vec3Array, vec2: vector.Vec3Array): ...@@ -61,9 +61,9 @@ def assert_vectors_equal(vec1: vector.Vec3Array, vec2: vector.Vec3Array):
def assert_vectors_close(vec1: vector.Vec3Array, vec2: vector.Vec3Array): def assert_vectors_close(vec1: vector.Vec3Array, vec2: vector.Vec3Array):
np.testing.assert_allclose(vec1.x, vec2.x, atol=1e-6, rtol=0.) np.testing.assert_allclose(vec1.x, vec2.x, atol=1e-5, rtol=0.)
np.testing.assert_allclose(vec1.y, vec2.y, atol=1e-6, rtol=0.) np.testing.assert_allclose(vec1.y, vec2.y, atol=1e-5, rtol=0.)
np.testing.assert_allclose(vec1.z, vec2.z, atol=1e-6, rtol=0.) np.testing.assert_allclose(vec1.z, vec2.z, atol=1e-5, rtol=0.)
def assert_array_close_to_vector(array: jnp.ndarray, vec: vector.Vec3Array): def assert_array_close_to_vector(array: jnp.ndarray, vec: vector.Vec3Array):
......
...@@ -29,8 +29,7 @@ class PrngTest(absltest.TestCase): ...@@ -29,8 +29,7 @@ class PrngTest(absltest.TestCase):
raw_key = safe_key.get() raw_key = safe_key.get()
self.assertNotEqual(raw_key[0], init_key[0]) self.assertFalse((raw_key == init_key).all())
self.assertNotEqual(raw_key[1], init_key[1])
with self.assertRaises(RuntimeError): with self.assertRaises(RuntimeError):
safe_key.get() safe_key.get()
......
...@@ -160,8 +160,14 @@ def padding_consistent_rng(f): ...@@ -160,8 +160,14 @@ def padding_consistent_rng(f):
return jax.vmap(functools.partial(grid_keys, shape=shape[1:]))(new_keys) return jax.vmap(functools.partial(grid_keys, shape=shape[1:]))(new_keys)
def inner(key, shape, **kwargs): def inner(key, shape, **kwargs):
keys = grid_keys(key, shape)
signature = (
'()->()'
if jax.dtypes.issubdtype(keys.dtype, jax.dtypes.prng_key)
else '(2)->()'
)
return jnp.vectorize( return jnp.vectorize(
lambda key: f(key, shape=(), **kwargs), functools.partial(f, shape=(), **kwargs), signature=signature
signature='(2)->()')( )(keys)
grid_keys(key, shape))
return inner return inner
...@@ -13,7 +13,6 @@ ...@@ -13,7 +13,6 @@
# limitations under the License. # limitations under the License.
"""Helper methods for the AlphaFold Colab notebook.""" """Helper methods for the AlphaFold Colab notebook."""
import json
from typing import AbstractSet, Any, Mapping, Optional, Sequence from typing import AbstractSet, Any, Mapping, Optional, Sequence
from alphafold.common import residue_constants from alphafold.common import residue_constants
...@@ -143,31 +142,6 @@ def empty_placeholder_template_features( ...@@ -143,31 +142,6 @@ def empty_placeholder_template_features(
} }
def get_pae_json(pae: np.ndarray, max_pae: float) -> str:
"""Returns the PAE in the same format as is used in the AFDB.
Note that the values are presented as floats to 1 decimal place,
whereas AFDB returns integer values.
Args:
pae: The n_res x n_res PAE array.
max_pae: The maximum possible PAE value.
Returns:
PAE output format as a JSON string.
"""
# Check the PAE array is the correct shape.
if (pae.ndim != 2 or pae.shape[0] != pae.shape[1]):
raise ValueError(f'PAE must be a square matrix, got {pae.shape}')
# Round the predicted aligned errors to 1 decimal place.
rounded_errors = np.round(pae.astype(np.float64), decimals=1)
formatted_output = [{
'predicted_aligned_error': rounded_errors.tolist(),
'max_predicted_aligned_error': max_pae
}]
return json.dumps(formatted_output, indent=None, separators=(',', ':'))
def check_cell_execution_order( def check_cell_execution_order(
cells_ran: AbstractSet[int], cell_number: int) -> None: cells_ran: AbstractSet[int], cell_number: int) -> None:
"""Check that the cell execution order is correct. """Check that the cell execution order is correct.
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment