Commit 346a6479 authored by zhuwenwen's avatar zhuwenwen
Browse files

remove unused code

parent 3e05e64c
<!--
* @Author: zhuww
* @email: zhuww@sugon.com
* @Date: 2023-04-06 18:04:07
* @LastEditTime: 2023-08-24 09:34:01
-->
# AF2
## 论文
- [https://www.nature.com/articles/s41586-021-03819-2](https://www.nature.com/articles/s41586-021-03819-2)
## 模型结构
模型核心是一个基于Transformer架构的神经网络,包括两个主要组件:Sequence to Sequence Model和Structure Model,这两个组件通过迭代训练进行优化,以提高其预测准确性。
![img](./docs/alphafold2.png)
## 算法原理
AlphaFold2通过从蛋白质序列和结构数据中提取信息,使用神经网络模型来预测蛋白质三维结构。
![img](./docs/alphafold2_1.png)
## 环境配置
提供[光源](https://www.sourcefind.cn/#/service-details)拉取推理的docker镜像:
```
docker pull image.sourcefind.cn:5000/dcu/admin/base/custom:alphafold2-2.2.1-centos7.6-dtk-22.04.2-py38
# <Image ID>用上面拉取docker镜像的ID替换
# <Host Path>主机端路径
# <Container Path>容器映射路径
docker run -it --name alphafold --shm-size=32G --device=/dev/kfd --device=/dev/dri/ --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --ulimit memlock=-1:-1 --ipc=host --network host --group-add video -v <Host Path>:<Container Path> <Image ID> /bin/bash
```
镜像版本依赖:
* DTK驱动:dtk22.04.2
* Jax: 0.3.14
* TensorFlow2: 2.10.0
* python: python3.8
激活镜像环境:
`source /opt/dtk-22.04.2/env.sh`
`source /opt/openmm-hip/env.sh`
测试目录:
`/opt/docker/tests/alphafold`
## 数据集
推荐使用AlphaFold2中的开源数据集,包括BFD、MGnify、PDB70、Uniclust、Uniref90等,数据集大小约2.2TB。数据集格式如下:
```
$DOWNLOAD_DIR/
bfd/
bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_hhm.ffindex
bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_hhm.ffdata
bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_cs219.ffindex
...
mgnify/
mgy_clusters_2018_12.fa
params/
params_model_1.npz
params_model_2.npz
params_model_3.npz
...
pdb70/
pdb_filter.dat
pdb70_hhm.ffindex
pdb70_hhm.ffdata
...
pdb_mmcif/
mmcif_files/
100d.cif
101d.cif
101m.cif
...
obsolete.dat
pdb_seqres/
pdb_seqres.txt
small_bfd/
bfd-first_non_consensus_sequences.fasta
uniclust30/
uniclust30_2018_08/
uniclust30_2018_08_md5sum
uniclust30_2018_08_hhm_db.index
uniclust30_2018_08_hhm_db
...
uniprot/
uniprot.fasta
uniref90/
uniref90.fasta
```
此处提供了一个脚本download_all_data.sh用于下载使用的数据集和模型文件:
./scripts/download_all_data.sh 数据集下载目录
## 推理
分别提供了基于Jax的单体和多体的推理脚本.
设置`run_alphafold.py`中DOWNLOAD_DIR路径和output_dir路径。确保输出目录存在,并且您有足够的权限对其进行写入。
# Set to target of download all databases
DOWNLOAD_DIR = '/path/to/database'
# Path to a directory that will store the results.
output_dir = '/path/to/output_dir'
### 单体
python3 run_alphafold.py \
--fasta_paths=monomer.fasta \
--output_dir=./ \
--max_template_date=2020-05-14 \
--model_preset=monomer \
--run_relax=true \
--use_gpu_relax=true
或者使用`./run_monomer.sh`
#### 单体推理参数说明
monomer.fasta为推理的单体序列;`--output_dir`为输出目录;`--model_preset`选择模型配置;`--run_relax=true`为进行relax操作;`--use_gpu_relax=true`为使用gpu进行relax操作(速度更快,但可能不太稳定),`--use_gpu_relax=false`为使用CPU进行relax操作(速度慢,但稳定);若添加--use_precomputed_msas=true则可以加载已经搜索对齐的序列,否则默认进行搜索对齐;
### 多体
python3 run_alphafold.py \
--fasta_paths=multimer.fasta \
--output_dir=./ \
--uniprot_database_path=/data/uniprot/uniprot_trembl.fasta \
--pdb_seqres_database_path=/data/pdb_seqres/pdb_seqres.txt \
--pdb70_database_path= \
--max_template_date=2020-05-14 \
--model_preset=multimer \
--run_relax=true \
--use_gpu_relax=true
或者使用`./run_multimer.sh`
#### 多体推理参数说明
multimer.fasta为推理的多体序列,data为数据集下载路径,其他参数同单体推理参数说明一致。
## result
`--output_dir`目录结构如下:
```
<target_name>/
features.pkl
ranked_{0,1,2,3,4}.pdb
ranking_debug.json
relaxed_model_{1,2,3,4,5}.pdb
result_model_{1,2,3,4,5}.pkl
timings.json
unrelaxed_model_{1,2,3,4,5}.pdb
msas/
bfd_uniclust_hits.a3m
mgnify_hits.sto
uniref90_hits.sto
...
```
查看蛋白质3D结构:[https://www.pdbus.org/3d-view](https://www.pdbus.org/3d-view)
![img](./docs/result_pdb.png)
## 精度
测试数据:[casp14](https://www.predictioncenter.org/casp14/targetlist.cgi)[uniprot](https://www.uniprot.org/)
使用的加速卡:1张 DCU 1代-16G
1、计算lddt的值
python3 pkl2plddt.py
其中,data_path为推理生成的pkl文件路径。
2、其它精度值计算:[https://zhanggroup.org/TM-score/](https://zhanggroup.org/TM-score/)
准确性数据:
| 数据类型 | 序列类型 | 序列标签 | 序列长度 | GDT-TS | GDT-HA | LDDT | TM score | MaxSub | RMSD |
| :------: | :------: | :------: | :------: |:------: |:------: | :------: | :------: | :------: |:------: |
| fp32 | 单体 | T1026 | 172 | 0.849 | 0.658 | 75.050 | 0.901 | 0.851 | 1.6 |
| fp32 | 单体 | T1053 | 580 | 0.941 | 0.789 | 92.316 | 0.985 | 0.935 | 1.1 |
| fp32 | 单体 | T1091 | 863 | 0.492 | 0.332 | 85.083 | 0.740 | 0.388 | 6.7 |
## 应用场景
### 算法类别
NLP
### 热点应用行业
医疗,科研,教育
## 源码仓库及问题反馈
* [https://developer.hpccube.com/codes/modelzoo/alphafold2_jax](https://developer.hpccube.com/codes/modelzoo/alphafold2_jax)
## 参考
* [https://github.com/deepmind/alphafold](https://github.com/deepmind/alphafold)
import pickle
import numpy as np
import sys
np.set_printoptions(threshold=1000000000000000000)
data_path = r'output/monomer/result_model_1.pkl'
with open(data_path, 'rb') as f:
datas = pickle.load(f)
log = open('output/T1024.txt', mode = "a+", encoding = "utf-8")
# np.set_printoptions(threshold=5000)
print(np.mean(datas['plddt']),file=log)
# Copyright 2021 DeepMind Technologies Limited
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Full AlphaFold protein structure prediction script."""
import json
import os
import pathlib
import pickle
import random
import shutil
import sys
import time
from typing import Dict, Union, Optional
from absl import app
from absl import flags
from absl import logging
from alphafold.common import protein
from alphafold.common import residue_constants
from alphafold.data import pipeline
from alphafold.data import pipeline_multimer
from alphafold.data import templates
from alphafold.data.tools import hhsearch
from alphafold.data.tools import hmmsearch
from alphafold.model import config
from alphafold.model import data
from alphafold.model import model
from alphafold.relax import relax
import numpy as np
# Internal import (7716).
#### USER CONFIGURATION ####
# Set to target of scripts/download_all_databases.sh
DOWNLOAD_DIR = '/alphafold_data_set'
# Path to a directory that will store the results.
output_dir = '~/af2_out'
# Names of models to use.
model_names = [
'model_1',
'model_2',
'model_3',
'model_4',
'model_5',
]
# You can individually override the following paths if you have placed the
# data in locations other than the DOWNLOAD_DIR.
# Path to directory of supporting data, contains 'params' dir.
data_dir = DOWNLOAD_DIR
# Path to the Uniref90 database for use by JackHMMER.
uniref90_database_path = os.path.join(
DOWNLOAD_DIR, 'uniref90', 'uniref90.fasta')
# Path to the MGnify database for use by JackHMMER.
mgnify_database_path = os.path.join(
DOWNLOAD_DIR, 'mgnify', 'mgy_clusters_2018_12.fa')
# Path to the BFD database for use by HHblits.
bfd_database_path = os.path.join(
DOWNLOAD_DIR, 'bfd',
'bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt')
# Path to the Uniclust30 database for use by HHblits.
uniclust30_database_path = os.path.join(
DOWNLOAD_DIR, 'uniclust30', 'uniclust30_2018_08', 'uniclust30_2018_08')
#DOWNLOAD_DIR, 'uniclust30', 'UniRef30_2020_02')
# Path to the PDB70 database for use by HHsearch.
pdb70_database_path = os.path.join(DOWNLOAD_DIR, 'pdb70', 'pdb70')
# Path to a directory with template mmCIF structures, each named <pdb_id>.cif')
template_mmcif_dir = os.path.join(DOWNLOAD_DIR, 'pdb_mmcif', 'mmcif_files')
# Path to a file mapping obsolete PDB IDs to their replacements.
obsolete_pdbs_path = os.path.join(DOWNLOAD_DIR, 'pdb_mmcif', 'obsolete.dat')
#### END OF USER CONFIGURATION ####
flags.DEFINE_list('fasta_paths', None, 'Paths to FASTA files, each containing '
'one sequence. Paths should be separated by commas. '
'All FASTA paths must have a unique basename as the '
'basename is used to name the output directories for '
'each prediction.')
flags.DEFINE_string('output_dir', output_dir, 'Path to a directory that will '
'store the results.')
flags.DEFINE_list('model_names', model_names, 'Names of models to use.')
flags.DEFINE_string('data_dir', data_dir, 'Path to directory of supporting data.')
flags.DEFINE_string('jackhmmer_binary_path', 'jackhmmer',
'Path to the JackHMMER executable.')
flags.DEFINE_string('hhblits_binary_path', 'hhblits',
'Path to the HHblits executable.')
flags.DEFINE_string('hhsearch_binary_path', 'hhsearch',
'Path to the HHsearch executable.')
flags.DEFINE_string('hmmsearch_binary_path', shutil.which('hmmsearch'),
'Path to the hmmsearch executable.')
flags.DEFINE_string('hmmbuild_binary_path', shutil.which('hmmbuild'),
'Path to the hmmbuild executable.')
flags.DEFINE_string('kalign_binary_path', 'kalign',
'Path to the Kalign executable.')
flags.DEFINE_string('uniref90_database_path', uniref90_database_path, 'Path to the Uniref90 '
'database for use by JackHMMER.')
flags.DEFINE_string('mgnify_database_path', mgnify_database_path, 'Path to the MGnify '
'database for use by JackHMMER.')
flags.DEFINE_string('bfd_database_path', bfd_database_path, 'Path to the BFD '
'database for use by HHblits.')
flags.DEFINE_string('uniclust30_database_path', uniclust30_database_path, 'Path to the Uniclust30 '
'database for use by HHblits.')
flags.DEFINE_string('pdb70_database_path', pdb70_database_path, 'Path to the PDB70 '
'database for use by HHsearch.')
flags.DEFINE_string('template_mmcif_dir', template_mmcif_dir, 'Path to a directory with '
'template mmCIF structures, each named <pdb_id>.cif')
flags.DEFINE_string('small_bfd_database_path', None, 'Path to the small '
'version of BFD used with the "reduced_dbs" preset.')
flags.DEFINE_string('uniprot_database_path', None, 'Path to the Uniprot '
'database for use by JackHMMer.')
flags.DEFINE_string('pdb_seqres_database_path', None, 'Path to the PDB '
'seqres database for use by hmmsearch.')
flags.DEFINE_string('max_template_date', '2020-05-14', 'Maximum template release date '
'to consider. Important if folding historical test sets.')
flags.DEFINE_string('obsolete_pdbs_path', obsolete_pdbs_path, 'Path to file containing a '
'mapping from obsolete PDB IDs to the PDB IDs of their '
'replacements.')
flags.DEFINE_enum('db_preset', 'full_dbs',
['full_dbs', 'reduced_dbs'],
'Choose preset MSA database configuration - '
'smaller genetic database config (reduced_dbs) or '
'full genetic database config (full_dbs)')
flags.DEFINE_enum('model_preset', 'monomer',
['monomer', 'monomer_casp14', 'monomer_ptm', 'multimer'],
'Choose preset model configuration - the monomer model, '
'the monomer model with extra ensembling, monomer model with '
'pTM head, or multimer model')
flags.DEFINE_boolean('benchmark', False, 'Run multiple JAX model evaluations '
'to obtain a timing that excludes the compilation time, '
'which should be more indicative of the time required for '
'inferencing many proteins.')
flags.DEFINE_integer('random_seed', None, 'The random seed for the data '
'pipeline. By default, this is randomly generated. Note '
'that even if this is set, Alphafold may still not be '
'deterministic, because processes like GPU inference are '
'nondeterministic.')
flags.DEFINE_integer('num_multimer_predictions_per_model', 5, 'How many '
'predictions (each with a different random seed) will be '
'generated per model. E.g. if this is 2 and there are 5 '
'models then there will be 10 predictions per input. '
'Note: this FLAG only applies if model_preset=multimer')
flags.DEFINE_boolean('use_precomputed_msas', False, 'Whether to read MSAs that '
'have been written to disk instead of running the MSA '
'tools. The MSA files are looked up in the output '
'directory, so it must stay the same between multiple '
'runs that are to reuse the MSAs. WARNING: This will not '
'check if the sequence, database or configuration have '
'changed.')
flags.DEFINE_boolean('run_relax', True, 'Whether to run the final relaxation '
'step on the predicted models. Turning relax off might '
'result in predictions with distracting stereochemical '
'violations but might help in case you are having issues '
'with the relaxation stage.')
flags.DEFINE_boolean('use_gpu_relax', None, 'Whether to relax on GPU. '
'Relax on GPU can be much faster than CPU, so it is '
'recommended to enable if possible. GPUs must be available'
' if this setting is enabled.')
FLAGS = flags.FLAGS
MAX_TEMPLATE_HITS = 20
RELAX_MAX_ITERATIONS = 0
RELAX_ENERGY_TOLERANCE = 2.39
RELAX_STIFFNESS = 10.0
RELAX_EXCLUDE_RESIDUES = []
RELAX_MAX_OUTER_ITERATIONS = 3
def _check_flag(flag_name: str,
other_flag_name: str,
should_be_set: bool):
if should_be_set != bool(FLAGS[flag_name].value):
verb = 'be' if should_be_set else 'not be'
raise ValueError(f'{flag_name} must {verb} set when running with '
f'"--{other_flag_name}={FLAGS[other_flag_name].value}".')
def predict_structure(
fasta_path: str,
fasta_name: str,
output_dir_base: str,
data_pipeline: Union[pipeline.DataPipeline, pipeline_multimer.DataPipeline],
model_runners: Dict[str, model.RunModel],
amber_relaxer: relax.AmberRelaxation,
benchmark: bool,
random_seed: int):
"""Predicts structure using AlphaFold for the given sequence."""
logging.info('Predicting %s', fasta_name)
timings = {}
output_dir = os.path.join(output_dir_base, fasta_name)
if not os.path.exists(output_dir):
os.makedirs(output_dir)
msa_output_dir = os.path.join(output_dir, 'msas')
if not os.path.exists(msa_output_dir):
os.makedirs(msa_output_dir)
features_output_path = os.path.join(output_dir, 'features.pkl')
#if not os.path.exists(features_output_path):
# Get features.
t_0 = time.time()
# If we already have feature.pkl file, skip the MSA and template finding step
#
if os.path.exists(features_output_path):
feature_dict = pickle.load(open(features_output_path, 'rb'))
else:
feature_dict = data_pipeline.process(
input_fasta_path=fasta_path,
msa_output_dir=msa_output_dir)
timings['features'] = time.time() - t_0
# Write out features as a pickled dictionary.
features_output_path = os.path.join(output_dir, 'features.pkl')
with open(features_output_path, 'wb') as f:
pickle.dump(feature_dict, f, protocol=4)
unrelaxed_pdbs = {}
relaxed_pdbs = {}
ranking_confidences = {}
# Run the models.
num_models = len(model_runners)
for model_index, (model_name, model_runner) in enumerate(
model_runners.items()):
logging.info('Running model %s on %s', model_name, fasta_name)
t_0 = time.time()
model_random_seed = model_index + random_seed * num_models
processed_feature_dict = model_runner.process_features(
feature_dict, random_seed=model_random_seed)
timings[f'process_features_{model_name}'] = time.time() - t_0
t_0 = time.time()
prediction_result = model_runner.predict(processed_feature_dict,
random_seed=model_random_seed)
t_diff = time.time() - t_0
timings[f'predict_and_compile_{model_name}'] = t_diff
logging.info(
'Total JAX model %s on %s predict time (includes compilation time, see --benchmark): %.1fs',
model_name, fasta_name, t_diff)
if benchmark:
t_0 = time.time()
model_runner.predict(processed_feature_dict,
random_seed=model_random_seed)
t_diff = time.time() - t_0
timings[f'predict_benchmark_{model_name}'] = t_diff
logging.info(
'Total JAX model %s on %s predict time (excludes compilation time): %.1fs',
model_name, fasta_name, t_diff)
plddt = prediction_result['plddt']
ranking_confidences[model_name] = prediction_result['ranking_confidence']
# Save the model outputs.
result_output_path = os.path.join(output_dir, f'result_{model_name}.pkl')
with open(result_output_path, 'wb') as f:
pickle.dump(prediction_result, f, protocol=4)
# Add the predicted LDDT in the b-factor column.
# Note that higher predicted LDDT value means higher model confidence.
plddt_b_factors = np.repeat(
plddt[:, None], residue_constants.atom_type_num, axis=-1)
unrelaxed_protein = protein.from_prediction(
features=processed_feature_dict,
result=prediction_result,
b_factors=plddt_b_factors,
remove_leading_feature_dimension=not model_runner.multimer_mode)
unrelaxed_pdbs[model_name] = protein.to_pdb(unrelaxed_protein)
unrelaxed_pdb_path = os.path.join(output_dir, f'unrelaxed_{model_name}.pdb')
with open(unrelaxed_pdb_path, 'w') as f:
f.write(unrelaxed_pdbs[model_name])
if amber_relaxer:
# Relax the prediction.
t_0 = time.time()
relaxed_pdb_str, _, _ = amber_relaxer.process(prot=unrelaxed_protein)
timings[f'relax_{model_name}'] = time.time() - t_0
relaxed_pdbs[model_name] = relaxed_pdb_str
# Save the relaxed PDB.
relaxed_output_path = os.path.join(
output_dir, f'relaxed_{model_name}.pdb')
with open(relaxed_output_path, 'w') as f:
f.write(relaxed_pdb_str)
# Rank by model confidence and write out relaxed PDBs in rank order.
ranked_order = []
for idx, (model_name, _) in enumerate(
sorted(ranking_confidences.items(), key=lambda x: x[1], reverse=True)):
ranked_order.append(model_name)
ranked_output_path = os.path.join(output_dir, f'ranked_{idx}.pdb')
with open(ranked_output_path, 'w') as f:
if amber_relaxer:
f.write(relaxed_pdbs[model_name])
else:
f.write(unrelaxed_pdbs[model_name])
ranking_output_path = os.path.join(output_dir, 'ranking_debug.json')
with open(ranking_output_path, 'w') as f:
label = 'iptm+ptm' if 'iptm' in prediction_result else 'plddts'
f.write(json.dumps(
{label: ranking_confidences, 'order': ranked_order}, indent=4))
logging.info('Final timings for %s: %s', fasta_name, timings)
timings_output_path = os.path.join(output_dir, 'timings.json')
with open(timings_output_path, 'w') as f:
f.write(json.dumps(timings, indent=4))
def main(argv):
if len(argv) > 1:
raise app.UsageError('Too many command-line arguments.')
for tool_name in (
'jackhmmer', 'hhblits', 'hhsearch', 'hmmsearch', 'hmmbuild', 'kalign'):
if not FLAGS[f'{tool_name}_binary_path'].value:
raise ValueError(f'Could not find path to the "{tool_name}" binary. Make '
'sure it is installed on your system.')
use_small_bfd = FLAGS.db_preset == 'reduced_dbs'
_check_flag('small_bfd_database_path', 'db_preset',
should_be_set=use_small_bfd)
_check_flag('bfd_database_path', 'db_preset',
should_be_set=not use_small_bfd)
_check_flag('uniclust30_database_path', 'db_preset',
should_be_set=not use_small_bfd)
run_multimer_system = 'multimer' in FLAGS.model_preset
_check_flag('pdb70_database_path', 'model_preset',
should_be_set=not run_multimer_system)
_check_flag('pdb_seqres_database_path', 'model_preset',
should_be_set=run_multimer_system)
_check_flag('uniprot_database_path', 'model_preset',
should_be_set=run_multimer_system)
if FLAGS.model_preset == 'monomer_casp14':
num_ensemble = 8
else:
num_ensemble = 1
# Check for duplicate FASTA file names.
fasta_names = [pathlib.Path(p).stem for p in FLAGS.fasta_paths]
if len(fasta_names) != len(set(fasta_names)):
raise ValueError('All FASTA paths must have a unique basename.')
if run_multimer_system:
template_searcher = hmmsearch.Hmmsearch(
binary_path=FLAGS.hmmsearch_binary_path,
hmmbuild_binary_path=FLAGS.hmmbuild_binary_path,
database_path=FLAGS.pdb_seqres_database_path)
template_featurizer = templates.HmmsearchHitFeaturizer(
mmcif_dir=FLAGS.template_mmcif_dir,
max_template_date=FLAGS.max_template_date,
max_hits=MAX_TEMPLATE_HITS,
kalign_binary_path=FLAGS.kalign_binary_path,
release_dates_path=None,
obsolete_pdbs_path=FLAGS.obsolete_pdbs_path)
else:
template_searcher = hhsearch.HHSearch(
binary_path=FLAGS.hhsearch_binary_path,
databases=[FLAGS.pdb70_database_path])
template_featurizer = templates.HhsearchHitFeaturizer(
mmcif_dir=FLAGS.template_mmcif_dir,
max_template_date=FLAGS.max_template_date,
max_hits=MAX_TEMPLATE_HITS,
kalign_binary_path=FLAGS.kalign_binary_path,
release_dates_path=None,
obsolete_pdbs_path=FLAGS.obsolete_pdbs_path)
monomer_data_pipeline = pipeline.DataPipeline(
jackhmmer_binary_path=FLAGS.jackhmmer_binary_path,
hhblits_binary_path=FLAGS.hhblits_binary_path,
uniref90_database_path=FLAGS.uniref90_database_path,
mgnify_database_path=FLAGS.mgnify_database_path,
bfd_database_path=FLAGS.bfd_database_path,
uniclust30_database_path=FLAGS.uniclust30_database_path,
small_bfd_database_path=FLAGS.small_bfd_database_path,
template_searcher=template_searcher,
template_featurizer=template_featurizer,
use_small_bfd=use_small_bfd,
use_precomputed_msas=FLAGS.use_precomputed_msas)
if run_multimer_system:
num_predictions_per_model = FLAGS.num_multimer_predictions_per_model
data_pipeline = pipeline_multimer.DataPipeline(
monomer_data_pipeline=monomer_data_pipeline,
jackhmmer_binary_path=FLAGS.jackhmmer_binary_path,
uniprot_database_path=FLAGS.uniprot_database_path,
use_precomputed_msas=FLAGS.use_precomputed_msas)
else:
num_predictions_per_model = 1
data_pipeline = monomer_data_pipeline
model_runners = {}
model_names = config.MODEL_PRESETS[FLAGS.model_preset]
for model_name in model_names:
model_config = config.model_config(model_name)
if run_multimer_system:
model_config.model.num_ensemble_eval = num_ensemble
else:
model_config.data.eval.num_ensemble = num_ensemble
model_params = data.get_model_haiku_params(
model_name=model_name, data_dir=FLAGS.data_dir)
model_runner = model.RunModel(model_config, model_params)
for i in range(num_predictions_per_model):
model_runners[f'{model_name}_pred_{i}'] = model_runner
logging.info('Have %d models: %s', len(model_runners),
list(model_runners.keys()))
if FLAGS.run_relax:
amber_relaxer = relax.AmberRelaxation(
max_iterations=RELAX_MAX_ITERATIONS,
tolerance=RELAX_ENERGY_TOLERANCE,
stiffness=RELAX_STIFFNESS,
exclude_residues=RELAX_EXCLUDE_RESIDUES,
max_outer_iterations=RELAX_MAX_OUTER_ITERATIONS,
use_gpu=FLAGS.use_gpu_relax)
else:
amber_relaxer = None
random_seed = FLAGS.random_seed
if random_seed is None:
random_seed = random.randrange(sys.maxsize // len(model_runners))
logging.info('Using random seed %d for the data pipeline', random_seed)
# Predict structure for each of the sequences.
for i, fasta_path in enumerate(FLAGS.fasta_paths):
fasta_name = fasta_names[i]
predict_structure(
fasta_path=fasta_path,
fasta_name=fasta_name,
output_dir_base=FLAGS.output_dir,
data_pipeline=data_pipeline,
model_runners=model_runners,
amber_relaxer=amber_relaxer,
benchmark=FLAGS.benchmark,
random_seed=random_seed)
if __name__ == '__main__':
flags.mark_flags_as_required([
'fasta_paths',
'output_dir',
'data_dir',
'uniref90_database_path',
'mgnify_database_path',
'template_mmcif_dir',
'max_template_date',
'obsolete_pdbs_path',
'use_gpu_relax',
])
app.run(main)
python3 run_alphafold.py \
--fasta_paths=monomer.fasta \
--output_dir=./ \
--max_template_date=2020-05-14 \
--model_preset=monomer \
--run_relax=true \
--use_gpu_relax=true
python3 run_alphafold.py \
--fasta_paths=multimer.fasta \
--output_dir=./ \
--uniprot_database_path=/data/uniprot/uniprot_trembl.fasta \
--pdb_seqres_database_path=/data/pdb_seqres/pdb_seqres.txt \
--pdb70_database_path= \
--max_template_date=2020-05-14 \
--model_preset=multimer \
--run_relax=true \
--use_gpu_relax=true
#!/bin/bash
#
# Copyright 2021 DeepMind Technologies Limited
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
# Downloads and unzips all required data for AlphaFold.
#
# Usage: bash download_all_data.sh /path/to/download/directory
set -e
if [[ $# -eq 0 ]]; then
echo "Error: download directory must be provided as an input argument."
exit 1
fi
if ! command -v aria2c &> /dev/null ; then
echo "Error: aria2c could not be found. Please install aria2c (sudo apt install aria2)."
exit 1
fi
DOWNLOAD_DIR="$1"
DOWNLOAD_MODE="${2:-full_dbs}" # Default mode to full_dbs.
if [[ "${DOWNLOAD_MODE}" != full_dbs && "${DOWNLOAD_MODE}" != reduced_dbs ]]
then
echo "DOWNLOAD_MODE ${DOWNLOAD_MODE} not recognized."
exit 1
fi
SCRIPT_DIR="$(dirname "$(realpath "$0")")"
echo "Downloading AlphaFold parameters..."
bash "${SCRIPT_DIR}/download_alphafold_params.sh" "${DOWNLOAD_DIR}"
if [[ "${DOWNLOAD_MODE}" = reduced_dbs ]] ; then
echo "Downloading Small BFD..."
bash "${SCRIPT_DIR}/download_small_bfd.sh" "${DOWNLOAD_DIR}"
else
echo "Downloading BFD..."
bash "${SCRIPT_DIR}/download_bfd.sh" "${DOWNLOAD_DIR}"
fi
echo "Downloading MGnify..."
bash "${SCRIPT_DIR}/download_mgnify.sh" "${DOWNLOAD_DIR}"
echo "Downloading PDB70..."
bash "${SCRIPT_DIR}/download_pdb70.sh" "${DOWNLOAD_DIR}"
echo "Downloading PDB mmCIF files..."
bash "${SCRIPT_DIR}/download_pdb_mmcif.sh" "${DOWNLOAD_DIR}"
echo "Downloading Uniclust30..."
bash "${SCRIPT_DIR}/download_uniclust30.sh" "${DOWNLOAD_DIR}"
echo "Downloading Uniref90..."
bash "${SCRIPT_DIR}/download_uniref90.sh" "${DOWNLOAD_DIR}"
echo "Downloading UniProt..."
bash "${SCRIPT_DIR}/download_uniprot.sh" "${DOWNLOAD_DIR}"
echo "Downloading PDB SeqRes..."
bash "${SCRIPT_DIR}/download_pdb_seqres.sh" "${DOWNLOAD_DIR}"
echo "All data downloaded."
#!/bin/bash
#
# Copyright 2021 DeepMind Technologies Limited
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
# Downloads and unzips the AlphaFold parameters.
#
# Usage: bash download_alphafold_params.sh /path/to/download/directory
set -e
if [[ $# -eq 0 ]]; then
echo "Error: download directory must be provided as an input argument."
exit 1
fi
if ! command -v aria2c &> /dev/null ; then
echo "Error: aria2c could not be found. Please install aria2c (sudo apt install aria2)."
exit 1
fi
DOWNLOAD_DIR="$1"
ROOT_DIR="${DOWNLOAD_DIR}/params"
SOURCE_URL="https://storage.googleapis.com/alphafold/alphafold_params_2022-03-02.tar"
BASENAME=$(basename "${SOURCE_URL}")
mkdir --parents "${ROOT_DIR}"
aria2c "${SOURCE_URL}" --dir="${ROOT_DIR}"
tar --extract --verbose --file="${ROOT_DIR}/${BASENAME}" \
--directory="${ROOT_DIR}" --preserve-permissions
rm "${ROOT_DIR}/${BASENAME}"
#!/bin/bash
#
# Copyright 2021 DeepMind Technologies Limited
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
# Downloads and unzips the BFD database for AlphaFold.
#
# Usage: bash download_bfd.sh /path/to/download/directory
set -e
if [[ $# -eq 0 ]]; then
echo "Error: download directory must be provided as an input argument."
exit 1
fi
if ! command -v aria2c &> /dev/null ; then
echo "Error: aria2c could not be found. Please install aria2c (sudo apt install aria2)."
exit 1
fi
DOWNLOAD_DIR="$1"
ROOT_DIR="${DOWNLOAD_DIR}/bfd"
# Mirror of:
# https://bfd.mmseqs.com/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt.tar.gz.
SOURCE_URL="https://storage.googleapis.com/alphafold-databases/casp14_versions/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt.tar.gz"
BASENAME=$(basename "${SOURCE_URL}")
mkdir --parents "${ROOT_DIR}"
aria2c "${SOURCE_URL}" --dir="${ROOT_DIR}"
tar --extract --verbose --file="${ROOT_DIR}/${BASENAME}" \
--directory="${ROOT_DIR}"
rm "${ROOT_DIR}/${BASENAME}"
#!/bin/bash
#
# Copyright 2021 DeepMind Technologies Limited
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
# Downloads and unzips the MGnify database for AlphaFold.
#
# Usage: bash download_mgnify.sh /path/to/download/directory
set -e
if [[ $# -eq 0 ]]; then
echo "Error: download directory must be provided as an input argument."
exit 1
fi
if ! command -v aria2c &> /dev/null ; then
echo "Error: aria2c could not be found. Please install aria2c (sudo apt install aria2)."
exit 1
fi
DOWNLOAD_DIR="$1"
ROOT_DIR="${DOWNLOAD_DIR}/mgnify"
# Mirror of:
# ftp://ftp.ebi.ac.uk/pub/databases/metagenomics/peptide_database/2018_12/mgy_clusters.fa.gz
SOURCE_URL="https://storage.googleapis.com/alphafold-databases/casp14_versions/mgy_clusters_2018_12.fa.gz"
BASENAME=$(basename "${SOURCE_URL}")
mkdir --parents "${ROOT_DIR}"
aria2c "${SOURCE_URL}" --dir="${ROOT_DIR}"
pushd "${ROOT_DIR}"
gunzip "${ROOT_DIR}/${BASENAME}"
popd
#!/bin/bash
#
# Copyright 2021 DeepMind Technologies Limited
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
# Downloads and unzips the PDB70 database for AlphaFold.
#
# Usage: bash download_pdb70.sh /path/to/download/directory
set -e
if [[ $# -eq 0 ]]; then
echo "Error: download directory must be provided as an input argument."
exit 1
fi
if ! command -v aria2c &> /dev/null ; then
echo "Error: aria2c could not be found. Please install aria2c (sudo apt install aria2)."
exit 1
fi
DOWNLOAD_DIR="$1"
ROOT_DIR="${DOWNLOAD_DIR}/pdb70"
SOURCE_URL="http://wwwuser.gwdg.de/~compbiol/data/hhsuite/databases/hhsuite_dbs/old-releases/pdb70_from_mmcif_200401.tar.gz"
BASENAME=$(basename "${SOURCE_URL}")
mkdir --parents "${ROOT_DIR}"
aria2c "${SOURCE_URL}" --dir="${ROOT_DIR}"
tar --extract --verbose --file="${ROOT_DIR}/${BASENAME}" \
--directory="${ROOT_DIR}"
rm "${ROOT_DIR}/${BASENAME}"
#!/bin/bash
#
# Copyright 2021 DeepMind Technologies Limited
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
# Downloads, unzips and flattens the PDB database for AlphaFold.
#
# Usage: bash download_pdb_mmcif.sh /path/to/download/directory
set -e
if [[ $# -eq 0 ]]; then
echo "Error: download directory must be provided as an input argument."
exit 1
fi
if ! command -v aria2c &> /dev/null ; then
echo "Error: aria2c could not be found. Please install aria2c (sudo apt install aria2)."
exit 1
fi
if ! command -v rsync &> /dev/null ; then
echo "Error: rsync could not be found. Please install rsync."
exit 1
fi
DOWNLOAD_DIR="$1"
ROOT_DIR="${DOWNLOAD_DIR}/pdb_mmcif"
RAW_DIR="${ROOT_DIR}/raw"
MMCIF_DIR="${ROOT_DIR}/mmcif_files"
echo "Running rsync to fetch all mmCIF files (note that the rsync progress estimate might be inaccurate)..."
echo "If the download speed is too slow, try changing the mirror to:"
echo " * rsync.ebi.ac.uk::pub/databases/pdb/data/structures/divided/mmCIF/ (Europe)"
echo " * ftp.pdbj.org::ftp_data/structures/divided/mmCIF/ (Asia)"
echo "or see https://www.wwpdb.org/ftp/pdb-ftp-sites for more download options."
mkdir --parents "${RAW_DIR}"
rsync --recursive --links --perms --times --compress --info=progress2 --delete --port=33444 \
rsync.rcsb.org::ftp_data/structures/divided/mmCIF/ \
"${RAW_DIR}"
# rsync --recursive --links --perms --times --compress --info=progress2 --delete --port=33444 \
# data.pdbj.org::ftp_data/structures/divided/mmCIF/ \
# "${RAW_DIR}"
echo "Unzipping all mmCIF files..."
find "${RAW_DIR}/" -type f -iname "*.gz" -exec gunzip {} +
echo "Flattening all mmCIF files..."
mkdir --parents "${MMCIF_DIR}"
find "${RAW_DIR}" -type d -empty -delete # Delete empty directories.
for subdir in "${RAW_DIR}"; do
mv "${subdir}/"*.cif "${MMCIF_DIR}"
done
# Delete empty download directory structure.
find "${RAW_DIR}" -type d -empty -delete
aria2c "ftp://ftp.wwpdb.org/pub/pdb/data/status/obsolete.dat" --dir="${ROOT_DIR}"
#!/bin/bash
#
# Copyright 2021 DeepMind Technologies Limited
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
# Downloads and unzips the PDB SeqRes database for AlphaFold.
#
# Usage: bash download_pdb_seqres.sh /path/to/download/directory
set -e
if [[ $# -eq 0 ]]; then
echo "Error: download directory must be provided as an input argument."
exit 1
fi
if ! command -v aria2c &> /dev/null ; then
echo "Error: aria2c could not be found. Please install aria2c (sudo apt install aria2)."
exit 1
fi
DOWNLOAD_DIR="$1"
ROOT_DIR="${DOWNLOAD_DIR}/pdb_seqres"
SOURCE_URL="ftp://ftp.wwpdb.org/pub/pdb/derived_data/pdb_seqres.txt"
BASENAME=$(basename "${SOURCE_URL}")
mkdir --parents "${ROOT_DIR}"
aria2c "${SOURCE_URL}" --dir="${ROOT_DIR}"
#!/bin/bash
#
# Copyright 2021 DeepMind Technologies Limited
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
# Downloads and unzips the Small BFD database for AlphaFold.
#
# Usage: bash download_small_bfd.sh /path/to/download/directory
set -e
if [[ $# -eq 0 ]]; then
echo "Error: download directory must be provided as an input argument."
exit 1
fi
if ! command -v aria2c &> /dev/null ; then
echo "Error: aria2c could not be found. Please install aria2c (sudo apt install aria2)."
exit 1
fi
DOWNLOAD_DIR="$1"
ROOT_DIR="${DOWNLOAD_DIR}/small_bfd"
SOURCE_URL="https://storage.googleapis.com/alphafold-databases/reduced_dbs/bfd-first_non_consensus_sequences.fasta.gz"
BASENAME=$(basename "${SOURCE_URL}")
mkdir --parents "${ROOT_DIR}"
aria2c "${SOURCE_URL}" --dir="${ROOT_DIR}"
pushd "${ROOT_DIR}"
gunzip "${ROOT_DIR}/${BASENAME}"
popd
#!/bin/bash
#
# Copyright 2021 DeepMind Technologies Limited
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
# Downloads and unzips the Uniclust30 database for AlphaFold.
#
# Usage: bash download_uniclust30.sh /path/to/download/directory
set -e
if [[ $# -eq 0 ]]; then
echo "Error: download directory must be provided as an input argument."
exit 1
fi
if ! command -v aria2c &> /dev/null ; then
echo "Error: aria2c could not be found. Please install aria2c (sudo apt install aria2)."
exit 1
fi
DOWNLOAD_DIR="$1"
ROOT_DIR="${DOWNLOAD_DIR}/uniclust30"
# Mirror of:
# http://wwwuser.gwdg.de/~compbiol/uniclust/2018_08/uniclust30_2018_08_hhsuite.tar.gz
SOURCE_URL="https://storage.googleapis.com/alphafold-databases/casp14_versions/uniclust30_2018_08_hhsuite.tar.gz"
BASENAME=$(basename "${SOURCE_URL}")
mkdir --parents "${ROOT_DIR}"
aria2c "${SOURCE_URL}" --dir="${ROOT_DIR}"
tar --extract --verbose --file="${ROOT_DIR}/${BASENAME}" \
--directory="${ROOT_DIR}"
rm "${ROOT_DIR}/${BASENAME}"
#!/bin/bash
#
# Copyright 2021 DeepMind Technologies Limited
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
# Downloads, unzips and merges the SwissProt and TrEMBL databases for
# AlphaFold-Multimer.
#
# Usage: bash download_uniprot.sh /path/to/download/directory
set -e
if [[ $# -eq 0 ]]; then
echo "Error: download directory must be provided as an input argument."
exit 1
fi
if ! command -v aria2c &> /dev/null ; then
echo "Error: aria2c could not be found. Please install aria2c (sudo apt install aria2)."
exit 1
fi
DOWNLOAD_DIR="$1"
ROOT_DIR="${DOWNLOAD_DIR}/uniprot"
TREMBL_SOURCE_URL="ftp://ftp.ebi.ac.uk/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_trembl.fasta.gz"
TREMBL_BASENAME=$(basename "${TREMBL_SOURCE_URL}")
TREMBL_UNZIPPED_BASENAME="${TREMBL_BASENAME%.gz}"
SPROT_SOURCE_URL="ftp://ftp.ebi.ac.uk/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz"
SPROT_BASENAME=$(basename "${SPROT_SOURCE_URL}")
SPROT_UNZIPPED_BASENAME="${SPROT_BASENAME%.gz}"
mkdir --parents "${ROOT_DIR}"
aria2c "${TREMBL_SOURCE_URL}" --dir="${ROOT_DIR}"
aria2c "${SPROT_SOURCE_URL}" --dir="${ROOT_DIR}"
pushd "${ROOT_DIR}"
gunzip "${ROOT_DIR}/${TREMBL_BASENAME}"
gunzip "${ROOT_DIR}/${SPROT_BASENAME}"
# Concatenate TrEMBL and SwissProt, rename to uniprot and clean up.
cat "${ROOT_DIR}/${SPROT_UNZIPPED_BASENAME}" >> "${ROOT_DIR}/${TREMBL_UNZIPPED_BASENAME}"
mv "${ROOT_DIR}/${TREMBL_UNZIPPED_BASENAME}" "${ROOT_DIR}/uniprot.fasta"
rm "${ROOT_DIR}/${SPROT_UNZIPPED_BASENAME}"
popd
#!/bin/bash
#
# Copyright 2021 DeepMind Technologies Limited
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
# Downloads and unzips the UniRef90 database for AlphaFold.
#
# Usage: bash download_uniref90.sh /path/to/download/directory
set -e
if [[ $# -eq 0 ]]; then
echo "Error: download directory must be provided as an input argument."
exit 1
fi
if ! command -v aria2c &> /dev/null ; then
echo "Error: aria2c could not be found. Please install aria2c (sudo apt install aria2)."
exit 1
fi
DOWNLOAD_DIR="$1"
ROOT_DIR="${DOWNLOAD_DIR}/uniref90"
SOURCE_URL="ftp://ftp.uniprot.org/pub/databases/uniprot/uniref/uniref90/uniref90.fasta.gz"
BASENAME=$(basename "${SOURCE_URL}")
mkdir --parents "${ROOT_DIR}"
aria2c "${SOURCE_URL}" --dir="${ROOT_DIR}"
pushd "${ROOT_DIR}"
gunzip "${ROOT_DIR}/${BASENAME}"
popd
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment