remove unused code

346a6479 · zhuwenwen · 3e05e64c · 3e05e64c · 3e05e64c · 3e05e64c
Commit 346a6479 authored Nov 15, 2023 by zhuwenwen
17 changed files
--- a/README.md
+++ b/README.md
-<!--
- * @Author: zhuww
- * @email: zhuww@sugon.com
- * @Date: 2023-04-06 18:04:07
- * @LastEditTime: 2023-08-24 09:34:01
-->
-# AF2
-## 论文
- [https://www.nature.com/articles/s41586-021-03819-2](https://www.nature.com/articles/s41586-021-03819-2)
-## 模型结构
-模型核心是一个基于Transformer架构的神经网络，包括两个主要组件：Sequence to Sequence Model和Structure Model，这两个组件通过迭代训练进行优化，以提高其预测准确性。
-![img](./docs/alphafold2.png)
-## 算法原理
-AlphaFold2通过从蛋白质序列和结构数据中提取信息，使用神经网络模型来预测蛋白质三维结构。
-![img](./docs/alphafold2_1.png)
-## 环境配置
-提供[光源](https://www.sourcefind.cn/#/service-details)拉取推理的docker镜像：
-```
-docker pull image.sourcefind.cn:5000/dcu/admin/base/custom:alphafold2-2.2.1-centos7.6-dtk-22.04.2-py38
-# <Image ID>用上面拉取docker镜像的ID替换
-# <Host Path>主机端路径
-# <Container Path>容器映射路径
-docker run -it --name alphafold --shm-size=32G  --device=/dev/kfd --device=/dev/dri/ --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --ulimit memlock=-1:-1 --ipc=host --network host --group-add video -v <Host Path>:<Container Path> <Image ID> /bin/bash
-```
-镜像版本依赖：
-* DTK驱动：dtk22.04.2
-* Jax: 0.3.14
-* TensorFlow2: 2.10.0
-* python: python3.8
-激活镜像环境：
-`source /opt/dtk-22.04.2/env.sh`
-`source /opt/openmm-hip/env.sh`
-测试目录：
-`/opt/docker/tests/alphafold`
-## 数据集
-推荐使用AlphaFold2中的开源数据集，包括BFD、MGnify、PDB70、Uniclust、Uniref90等,数据集大小约2.2TB。数据集格式如下：
-```
-$DOWNLOAD_DIR/                             
-    bfd/  
-        bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_hhm.ffindex
-        bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_hhm.ffdata 
-        bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_cs219.ffindex                           
-        ...
-    mgnify/                                
-        mgy_clusters_2018_12.fa
-    params/                                
-        params_model_1.npz
-        params_model_2.npz
-        params_model_3.npz
-        ...
-    pdb70/                                
-        pdb_filter.dat
-        pdb70_hhm.ffindex
-        pdb70_hhm.ffdata
-        ...
-    pdb_mmcif/                            
-        mmcif_files/
-            100d.cif
-            101d.cif
-            101m.cif
-            ...
-        obsolete.dat
-    pdb_seqres/                            
-        pdb_seqres.txt
-    small_bfd/                           
-        bfd-first_non_consensus_sequences.fasta
-    uniclust30/                            
-        uniclust30_2018_08/
-            uniclust30_2018_08_md5sum
-            uniclust30_2018_08_hhm_db.index
-            uniclust30_2018_08_hhm_db
-            ...
-    uniprot/                               
-        uniprot.fasta
-    uniref90/                             
-        uniref90.fasta
-```
-此处提供了一个脚本download_all_data.sh用于下载使用的数据集和模型文件：
-    ./scripts/download_all_data.sh 数据集下载目录
-## 推理
-分别提供了基于Jax的单体和多体的推理脚本.
-设置`run_alphafold.py`中DOWNLOAD_DIR路径和output_dir路径。确保输出目录存在，并且您有足够的权限对其进行写入。
-    # Set to target of download all databases
-    DOWNLOAD_DIR = '/path/to/database'
-    # Path to a directory that will store the results.
-    output_dir = '/path/to/output_dir'
-### 单体
-    python3 run_alphafold.py \
-    --fasta_paths=monomer.fasta \
-    --output_dir=./ \
-    --max_template_date=2020-05-14 \
-    --model_preset=monomer \
-    --run_relax=true \
-    --use_gpu_relax=true
-或者使用`./run_monomer.sh`
-#### 单体推理参数说明
-monomer.fasta为推理的单体序列；`--output_dir`为输出目录；`--model_preset`选择模型配置；`--run_relax=true`为进行relax操作；`--use_gpu_relax=true`为使用gpu进行relax操作（速度更快，但可能不太稳定），`--use_gpu_relax=false`为使用CPU进行relax操作（速度慢，但稳定）；若添加--use_precomputed_msas=true则可以加载已经搜索对齐的序列，否则默认进行搜索对齐；
-### 多体
-    python3 run_alphafold.py \
-    --fasta_paths=multimer.fasta \
-    --output_dir=./ \
-    --uniprot_database_path=/data/uniprot/uniprot_trembl.fasta \
-    --pdb_seqres_database_path=/data/pdb_seqres/pdb_seqres.txt \
-    --pdb70_database_path= \
-    --max_template_date=2020-05-14 \
-    --model_preset=multimer \
-    --run_relax=true \
-    --use_gpu_relax=true
-或者使用`./run_multimer.sh`
-#### 多体推理参数说明
-multimer.fasta为推理的多体序列，data为数据集下载路径，其他参数同单体推理参数说明一致。
-## result
-`--output_dir`目录结构如下：
-```
-<target_name>/
-    features.pkl
-    ranked_{0,1,2,3,4}.pdb
-    ranking_debug.json
-    relaxed_model_{1,2,3,4,5}.pdb
-    result_model_{1,2,3,4,5}.pkl
-    timings.json
-    unrelaxed_model_{1,2,3,4,5}.pdb
-    msas/
-        bfd_uniclust_hits.a3m
-        mgnify_hits.sto
-        uniref90_hits.sto
-        ...
-```
-查看蛋白质3D结构：[https://www.pdbus.org/3d-view](https://www.pdbus.org/3d-view)
-![img](./docs/result_pdb.png)
-## 精度
-测试数据：[casp14](https://www.predictioncenter.org/casp14/targetlist.cgi)、[uniprot](https://www.uniprot.org/)，
-使用的加速卡:1张 DCU 1代-16G
-1、计算lddt的值
-    python3 pkl2plddt.py
-    其中，data_path为推理生成的pkl文件路径。
-2、其它精度值计算：[https://zhanggroup.org/TM-score/](https://zhanggroup.org/TM-score/)
-准确性数据：
-| 数据类型 | 序列类型 | 序列标签 | 序列长度 | GDT-TS | GDT-HA | LDDT | TM score | MaxSub | RMSD |
-| :------: | :------: | :------: | :------: |:------: |:------: | :------: | :------: | :------: |:------: |
-| fp32 | 单体 | T1026 | 172 | 0.849 | 0.658 | 75.050 | 0.901 | 0.851 | 1.6 |
-| fp32 | 单体 | T1053 | 580 | 0.941 | 0.789 | 92.316 | 0.985 | 0.935 | 1.1 |
-| fp32 | 单体 | T1091 | 863 | 0.492 | 0.332 | 85.083 | 0.740 | 0.388 | 6.7 |
-## 应用场景
-### 算法类别
-NLP
-### 热点应用行业
-医疗,科研,教育
-## 源码仓库及问题反馈
-* [https://developer.hpccube.com/codes/modelzoo/alphafold2_jax](https://developer.hpccube.com/codes/modelzoo/alphafold2_jax)
-## 参考
-* [https://github.com/deepmind/alphafold](https://github.com/deepmind/alphafold)
--- a/icon.jpg
+++ b/icon.jpg
--- a/pkl2plddt.py
+++ b/pkl2plddt.py
-import pickle
-import numpy as np
-import sys
-np.set_printoptions(threshold=1000000000000000000)
-data_path = r'output/monomer/result_model_1.pkl'
-with open(data_path, 'rb') as f:
-  datas = pickle.load(f)
-log = open('output/T1024.txt', mode = "a+", encoding = "utf-8")
-# np.set_printoptions(threshold=5000)
-print(np.mean(datas['plddt']),file=log)
--- a/run_alphafold.py
+++ b/run_alphafold.py
-# Copyright 2021 DeepMind Technologies Limited
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#      http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-"""Full AlphaFold protein structure prediction script."""
-import json
-import os
-import pathlib
-import pickle
-import random
-import shutil
-import sys
-import time
-from typing import Dict, Union, Optional
-from absl import app
-from absl import flags
-from absl import logging
-from alphafold.common import protein
-from alphafold.common import residue_constants
-from alphafold.data import pipeline
-from alphafold.data import pipeline_multimer
-from alphafold.data import templates
-from alphafold.data.tools import hhsearch
-from alphafold.data.tools import hmmsearch
-from alphafold.model import config
-from alphafold.model import data
-from alphafold.model import model
-from alphafold.relax import relax
-import numpy as np
-# Internal import (7716).
-#### USER CONFIGURATION ####
-# Set to target of scripts/download_all_databases.sh
-DOWNLOAD_DIR = '/alphafold_data_set'
-# Path to a directory that will store the results.
-output_dir = '~/af2_out'
-# Names of models to use.
-model_names = [
-    'model_1',
-    'model_2',
-    'model_3',
-    'model_4',
-    'model_5',
-]
-# You can individually override the following paths if you have placed the
-# data in locations other than the DOWNLOAD_DIR.
-# Path to directory of supporting data, contains 'params' dir.
-data_dir = DOWNLOAD_DIR
-# Path to the Uniref90 database for use by JackHMMER.
-uniref90_database_path = os.path.join(
-    DOWNLOAD_DIR, 'uniref90', 'uniref90.fasta')
-# Path to the MGnify database for use by JackHMMER.
-mgnify_database_path = os.path.join(
-    DOWNLOAD_DIR, 'mgnify', 'mgy_clusters_2018_12.fa')
-# Path to the BFD database for use by HHblits.
-bfd_database_path = os.path.join(
-    DOWNLOAD_DIR, 'bfd',
-    'bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt')
-# Path to the Uniclust30 database for use by HHblits.
-uniclust30_database_path = os.path.join(
-    DOWNLOAD_DIR, 'uniclust30', 'uniclust30_2018_08', 'uniclust30_2018_08')
-    #DOWNLOAD_DIR, 'uniclust30', 'UniRef30_2020_02')
-# Path to the PDB70 database for use by HHsearch.
-pdb70_database_path = os.path.join(DOWNLOAD_DIR, 'pdb70', 'pdb70')
-# Path to a directory with template mmCIF structures, each named <pdb_id>.cif')
-template_mmcif_dir = os.path.join(DOWNLOAD_DIR, 'pdb_mmcif', 'mmcif_files')
-# Path to a file mapping obsolete PDB IDs to their replacements.
-obsolete_pdbs_path = os.path.join(DOWNLOAD_DIR, 'pdb_mmcif', 'obsolete.dat')
-#### END OF USER CONFIGURATION ####
-flags.DEFINE_list('fasta_paths', None, 'Paths to FASTA files, each containing '
-                  'one sequence. Paths should be separated by commas. '
-                  'All FASTA paths must have a unique basename as the '
-                  'basename is used to name the output directories for '
-                  'each prediction.')
-flags.DEFINE_string('output_dir', output_dir, 'Path to a directory that will '
-                    'store the results.')
-flags.DEFINE_list('model_names', model_names, 'Names of models to use.')
-flags.DEFINE_string('data_dir', data_dir, 'Path to directory of supporting data.')
-flags.DEFINE_string('jackhmmer_binary_path', 'jackhmmer',
-                    'Path to the JackHMMER executable.')
-flags.DEFINE_string('hhblits_binary_path', 'hhblits',
-                    'Path to the HHblits executable.')
-flags.DEFINE_string('hhsearch_binary_path', 'hhsearch',
-                    'Path to the HHsearch executable.')
-flags.DEFINE_string('hmmsearch_binary_path', shutil.which('hmmsearch'),
-                    'Path to the hmmsearch executable.')
-flags.DEFINE_string('hmmbuild_binary_path', shutil.which('hmmbuild'),
-                    'Path to the hmmbuild executable.')
-flags.DEFINE_string('kalign_binary_path', 'kalign',
-                    'Path to the Kalign executable.')
-flags.DEFINE_string('uniref90_database_path', uniref90_database_path, 'Path to the Uniref90 '
-                    'database for use by JackHMMER.')
-flags.DEFINE_string('mgnify_database_path', mgnify_database_path, 'Path to the MGnify '
-                    'database for use by JackHMMER.')
-flags.DEFINE_string('bfd_database_path', bfd_database_path, 'Path to the BFD '
-                    'database for use by HHblits.')
-flags.DEFINE_string('uniclust30_database_path', uniclust30_database_path, 'Path to the Uniclust30 '
-                    'database for use by HHblits.')
-flags.DEFINE_string('pdb70_database_path', pdb70_database_path, 'Path to the PDB70 '
-                    'database for use by HHsearch.')
-flags.DEFINE_string('template_mmcif_dir', template_mmcif_dir, 'Path to a directory with '
-                    'template mmCIF structures, each named <pdb_id>.cif')
-flags.DEFINE_string('small_bfd_database_path', None, 'Path to the small '
-                    'version of BFD used with the "reduced_dbs" preset.')
-flags.DEFINE_string('uniprot_database_path', None, 'Path to the Uniprot '
-                    'database for use by JackHMMer.')
-flags.DEFINE_string('pdb_seqres_database_path', None, 'Path to the PDB '
-                    'seqres database for use by hmmsearch.')
-flags.DEFINE_string('max_template_date', '2020-05-14', 'Maximum template release date '
-                    'to consider. Important if folding historical test sets.')
-flags.DEFINE_string('obsolete_pdbs_path', obsolete_pdbs_path, 'Path to file containing a '
-                    'mapping from obsolete PDB IDs to the PDB IDs of their '
-                    'replacements.')
-flags.DEFINE_enum('db_preset', 'full_dbs',
-                  ['full_dbs', 'reduced_dbs'],
-                  'Choose preset MSA database configuration - '
-                  'smaller genetic database config (reduced_dbs) or '
-                  'full genetic database config  (full_dbs)')
-flags.DEFINE_enum('model_preset', 'monomer',
-                  ['monomer', 'monomer_casp14', 'monomer_ptm', 'multimer'],
-                  'Choose preset model configuration - the monomer model, '
-                  'the monomer model with extra ensembling, monomer model with '
-                  'pTM head, or multimer model')
-flags.DEFINE_boolean('benchmark', False, 'Run multiple JAX model evaluations '
-                     'to obtain a timing that excludes the compilation time, '
-                     'which should be more indicative of the time required for '
-                     'inferencing many proteins.')
-flags.DEFINE_integer('random_seed', None, 'The random seed for the data '
-                     'pipeline. By default, this is randomly generated. Note '
-                     'that even if this is set, Alphafold may still not be '
-                     'deterministic, because processes like GPU inference are '
-                     'nondeterministic.')
-flags.DEFINE_integer('num_multimer_predictions_per_model', 5, 'How many '
-                     'predictions (each with a different random seed) will be '
-                     'generated per model. E.g. if this is 2 and there are 5 '
-                     'models then there will be 10 predictions per input. '
-                     'Note: this FLAG only applies if model_preset=multimer')
-flags.DEFINE_boolean('use_precomputed_msas', False, 'Whether to read MSAs that '
-                     'have been written to disk instead of running the MSA '
-                     'tools. The MSA files are looked up in the output '
-                     'directory, so it must stay the same between multiple '
-                     'runs that are to reuse the MSAs. WARNING: This will not '
-                     'check if the sequence, database or configuration have '
-                     'changed.')
-flags.DEFINE_boolean('run_relax', True, 'Whether to run the final relaxation '
-                     'step on the predicted models. Turning relax off might '
-                     'result in predictions with distracting stereochemical '
-                     'violations but might help in case you are having issues '
-                     'with the relaxation stage.')
-flags.DEFINE_boolean('use_gpu_relax', None, 'Whether to relax on GPU. '
-                     'Relax on GPU can be much faster than CPU, so it is '
-                     'recommended to enable if possible. GPUs must be available'
-                     ' if this setting is enabled.')
-FLAGS = flags.FLAGS
-MAX_TEMPLATE_HITS = 20
-RELAX_MAX_ITERATIONS = 0
-RELAX_ENERGY_TOLERANCE = 2.39
-RELAX_STIFFNESS = 10.0
-RELAX_EXCLUDE_RESIDUES = []
-RELAX_MAX_OUTER_ITERATIONS = 3
-def _check_flag(flag_name: str,
-                other_flag_name: str,
-                should_be_set: bool):
-  if should_be_set != bool(FLAGS[flag_name].value):
-    verb = 'be' if should_be_set else 'not be'
-    raise ValueError(f'{flag_name} must {verb} set when running with '
-                     f'"--{other_flag_name}={FLAGS[other_flag_name].value}".')
-def predict_structure(
-    fasta_path: str,
-    fasta_name: str,
-    output_dir_base: str,
-    data_pipeline: Union[pipeline.DataPipeline, pipeline_multimer.DataPipeline],
-    model_runners: Dict[str, model.RunModel],
-    amber_relaxer: relax.AmberRelaxation,
-    benchmark: bool,
-    random_seed: int):
-  """Predicts structure using AlphaFold for the given sequence."""
-  logging.info('Predicting %s', fasta_name)
-  timings = {}
-  output_dir = os.path.join(output_dir_base, fasta_name)
-  if not os.path.exists(output_dir):
-    os.makedirs(output_dir)
-  msa_output_dir = os.path.join(output_dir, 'msas')
-  if not os.path.exists(msa_output_dir):
-    os.makedirs(msa_output_dir)
-  features_output_path = os.path.join(output_dir, 'features.pkl')
-  #if not os.path.exists(features_output_path):
-  # Get features.
-  t_0 = time.time()
-# If we already have feature.pkl file, skip the MSA and template finding step
-# 
-  if os.path.exists(features_output_path):
-    feature_dict = pickle.load(open(features_output_path, 'rb'))
-  else:
-    feature_dict = data_pipeline.process(
-        input_fasta_path=fasta_path,
-        msa_output_dir=msa_output_dir)
-  timings['features'] = time.time() - t_0
-  # Write out features as a pickled dictionary.
-  features_output_path = os.path.join(output_dir, 'features.pkl')
-  with open(features_output_path, 'wb') as f:
-    pickle.dump(feature_dict, f, protocol=4)
-  unrelaxed_pdbs = {}
-  relaxed_pdbs = {}
-  ranking_confidences = {}
-  # Run the models.
-  num_models = len(model_runners)
-  for model_index, (model_name, model_runner) in enumerate(
-      model_runners.items()):
-    logging.info('Running model %s on %s', model_name, fasta_name)
-    t_0 = time.time()
-    model_random_seed = model_index + random_seed * num_models
-    processed_feature_dict = model_runner.process_features(
-        feature_dict, random_seed=model_random_seed)
-    timings[f'process_features_{model_name}'] = time.time() - t_0
-    t_0 = time.time()
-    prediction_result = model_runner.predict(processed_feature_dict,
-                                             random_seed=model_random_seed)
-    t_diff = time.time() - t_0
-    timings[f'predict_and_compile_{model_name}'] = t_diff
-    logging.info(
-        'Total JAX model %s on %s predict time (includes compilation time, see --benchmark): %.1fs',
-        model_name, fasta_name, t_diff)
-    if benchmark:
-      t_0 = time.time()
-      model_runner.predict(processed_feature_dict,
-                           random_seed=model_random_seed)
-      t_diff = time.time() - t_0
-      timings[f'predict_benchmark_{model_name}'] = t_diff
-      logging.info(
-          'Total JAX model %s on %s predict time (excludes compilation time): %.1fs',
-          model_name, fasta_name, t_diff)
-    plddt = prediction_result['plddt']
-    ranking_confidences[model_name] = prediction_result['ranking_confidence']
-    # Save the model outputs.
-    result_output_path = os.path.join(output_dir, f'result_{model_name}.pkl')
-    with open(result_output_path, 'wb') as f:
-      pickle.dump(prediction_result, f, protocol=4)
-    # Add the predicted LDDT in the b-factor column.
-    # Note that higher predicted LDDT value means higher model confidence.
-    plddt_b_factors = np.repeat(
-        plddt[:, None], residue_constants.atom_type_num, axis=-1)
-    unrelaxed_protein = protein.from_prediction(
-        features=processed_feature_dict,
-        result=prediction_result,
-        b_factors=plddt_b_factors,
-        remove_leading_feature_dimension=not model_runner.multimer_mode)
-    unrelaxed_pdbs[model_name] = protein.to_pdb(unrelaxed_protein)
-    unrelaxed_pdb_path = os.path.join(output_dir, f'unrelaxed_{model_name}.pdb')
-    with open(unrelaxed_pdb_path, 'w') as f:
-      f.write(unrelaxed_pdbs[model_name])
-    if amber_relaxer:
-      # Relax the prediction.
-      t_0 = time.time()
-      relaxed_pdb_str, _, _ = amber_relaxer.process(prot=unrelaxed_protein)
-      timings[f'relax_{model_name}'] = time.time() - t_0
-      relaxed_pdbs[model_name] = relaxed_pdb_str
-      # Save the relaxed PDB.
-      relaxed_output_path = os.path.join(
-          output_dir, f'relaxed_{model_name}.pdb')
-      with open(relaxed_output_path, 'w') as f:
-        f.write(relaxed_pdb_str)
-  # Rank by model confidence and write out relaxed PDBs in rank order.
-  ranked_order = []
-  for idx, (model_name, _) in enumerate(
-      sorted(ranking_confidences.items(), key=lambda x: x[1], reverse=True)):
-    ranked_order.append(model_name)
-    ranked_output_path = os.path.join(output_dir, f'ranked_{idx}.pdb')
-    with open(ranked_output_path, 'w') as f:
-      if amber_relaxer:
-        f.write(relaxed_pdbs[model_name])
-      else:
-        f.write(unrelaxed_pdbs[model_name])
-  ranking_output_path = os.path.join(output_dir, 'ranking_debug.json')
-  with open(ranking_output_path, 'w') as f:
-    label = 'iptm+ptm' if 'iptm' in prediction_result else 'plddts'
-    f.write(json.dumps(
-        {label: ranking_confidences, 'order': ranked_order}, indent=4))
-  logging.info('Final timings for %s: %s', fasta_name, timings)
-  timings_output_path = os.path.join(output_dir, 'timings.json')
-  with open(timings_output_path, 'w') as f:
-    f.write(json.dumps(timings, indent=4))
-def main(argv):
-  if len(argv) > 1:
-    raise app.UsageError('Too many command-line arguments.')
-  for tool_name in (
-      'jackhmmer', 'hhblits', 'hhsearch', 'hmmsearch', 'hmmbuild', 'kalign'):
-    if not FLAGS[f'{tool_name}_binary_path'].value:
-      raise ValueError(f'Could not find path to the "{tool_name}" binary. Make '
-                       'sure it is installed on your system.')
-  use_small_bfd = FLAGS.db_preset == 'reduced_dbs'
-  _check_flag('small_bfd_database_path', 'db_preset',
-              should_be_set=use_small_bfd)
-  _check_flag('bfd_database_path', 'db_preset',
-              should_be_set=not use_small_bfd)
-  _check_flag('uniclust30_database_path', 'db_preset',
-              should_be_set=not use_small_bfd)
-  run_multimer_system = 'multimer' in FLAGS.model_preset
-  _check_flag('pdb70_database_path', 'model_preset',
-              should_be_set=not run_multimer_system)
-  _check_flag('pdb_seqres_database_path', 'model_preset',
-              should_be_set=run_multimer_system)
-  _check_flag('uniprot_database_path', 'model_preset',
-              should_be_set=run_multimer_system)
-  if FLAGS.model_preset == 'monomer_casp14':
-    num_ensemble = 8
-  else:
-    num_ensemble = 1
-  # Check for duplicate FASTA file names.
-  fasta_names = [pathlib.Path(p).stem for p in FLAGS.fasta_paths]
-  if len(fasta_names) != len(set(fasta_names)):
-    raise ValueError('All FASTA paths must have a unique basename.')
-  if run_multimer_system:
-    template_searcher = hmmsearch.Hmmsearch(
-        binary_path=FLAGS.hmmsearch_binary_path,
-        hmmbuild_binary_path=FLAGS.hmmbuild_binary_path,
-        database_path=FLAGS.pdb_seqres_database_path)
-    template_featurizer = templates.HmmsearchHitFeaturizer(
-        mmcif_dir=FLAGS.template_mmcif_dir,
-        max_template_date=FLAGS.max_template_date,
-        max_hits=MAX_TEMPLATE_HITS,
-        kalign_binary_path=FLAGS.kalign_binary_path,
-        release_dates_path=None,
-        obsolete_pdbs_path=FLAGS.obsolete_pdbs_path)
-  else:
-    template_searcher = hhsearch.HHSearch(
-        binary_path=FLAGS.hhsearch_binary_path,
-        databases=[FLAGS.pdb70_database_path])
-    template_featurizer = templates.HhsearchHitFeaturizer(
-        mmcif_dir=FLAGS.template_mmcif_dir,
-        max_template_date=FLAGS.max_template_date,
-        max_hits=MAX_TEMPLATE_HITS,
-        kalign_binary_path=FLAGS.kalign_binary_path,
-        release_dates_path=None,
-        obsolete_pdbs_path=FLAGS.obsolete_pdbs_path)
-  monomer_data_pipeline = pipeline.DataPipeline(
-      jackhmmer_binary_path=FLAGS.jackhmmer_binary_path,
-      hhblits_binary_path=FLAGS.hhblits_binary_path,
-      uniref90_database_path=FLAGS.uniref90_database_path,
-      mgnify_database_path=FLAGS.mgnify_database_path,
-      bfd_database_path=FLAGS.bfd_database_path,
-      uniclust30_database_path=FLAGS.uniclust30_database_path,
-      small_bfd_database_path=FLAGS.small_bfd_database_path,
-      template_searcher=template_searcher,
-      template_featurizer=template_featurizer,
-      use_small_bfd=use_small_bfd,
-      use_precomputed_msas=FLAGS.use_precomputed_msas)
-  if run_multimer_system:
-    num_predictions_per_model = FLAGS.num_multimer_predictions_per_model
-    data_pipeline = pipeline_multimer.DataPipeline(
-        monomer_data_pipeline=monomer_data_pipeline,
-        jackhmmer_binary_path=FLAGS.jackhmmer_binary_path,
-        uniprot_database_path=FLAGS.uniprot_database_path,
-        use_precomputed_msas=FLAGS.use_precomputed_msas)
-  else:
-    num_predictions_per_model = 1
-    data_pipeline = monomer_data_pipeline
-  model_runners = {}
-  model_names = config.MODEL_PRESETS[FLAGS.model_preset]
-  for model_name in model_names:
-    model_config = config.model_config(model_name)
-    if run_multimer_system:
-      model_config.model.num_ensemble_eval = num_ensemble
-    else:
-      model_config.data.eval.num_ensemble = num_ensemble
-    model_params = data.get_model_haiku_params(
-        model_name=model_name, data_dir=FLAGS.data_dir)
-    model_runner = model.RunModel(model_config, model_params)
-    for i in range(num_predictions_per_model):
-      model_runners[f'{model_name}_pred_{i}'] = model_runner
-  logging.info('Have %d models: %s', len(model_runners),
-               list(model_runners.keys()))
-  if FLAGS.run_relax:
-    amber_relaxer = relax.AmberRelaxation(
-        max_iterations=RELAX_MAX_ITERATIONS,
-        tolerance=RELAX_ENERGY_TOLERANCE,
-        stiffness=RELAX_STIFFNESS,
-        exclude_residues=RELAX_EXCLUDE_RESIDUES,
-        max_outer_iterations=RELAX_MAX_OUTER_ITERATIONS,
-        use_gpu=FLAGS.use_gpu_relax)
-  else:
-    amber_relaxer = None
-  random_seed = FLAGS.random_seed
-  if random_seed is None:
-    random_seed = random.randrange(sys.maxsize // len(model_runners))
-  logging.info('Using random seed %d for the data pipeline', random_seed)
-  # Predict structure for each of the sequences.
-  for i, fasta_path in enumerate(FLAGS.fasta_paths):
-    fasta_name = fasta_names[i]
-    predict_structure(
-        fasta_path=fasta_path,
-        fasta_name=fasta_name,
-        output_dir_base=FLAGS.output_dir,
-        data_pipeline=data_pipeline,
-        model_runners=model_runners,
-        amber_relaxer=amber_relaxer,
-        benchmark=FLAGS.benchmark,
-        random_seed=random_seed)
-if __name__ == '__main__':
-  flags.mark_flags_as_required([
-      'fasta_paths',
-      'output_dir',
-      'data_dir',
-      'uniref90_database_path',
-      'mgnify_database_path',
-      'template_mmcif_dir',
-      'max_template_date',
-      'obsolete_pdbs_path',
-      'use_gpu_relax',
-  ])
-  app.run(main)
--- a/run_monomer.sh
+++ b/run_monomer.sh
-python3 run_alphafold.py \
- --fasta_paths=monomer.fasta \
- --output_dir=./ \
- --max_template_date=2020-05-14 \
- --model_preset=monomer \
- --run_relax=true \
- --use_gpu_relax=true
--- a/run_multimer.sh
+++ b/run_multimer.sh
-python3 run_alphafold.py \
- --fasta_paths=multimer.fasta \
- --output_dir=./ \
- --uniprot_database_path=/data/uniprot/uniprot_trembl.fasta \
- --pdb_seqres_database_path=/data/pdb_seqres/pdb_seqres.txt \
- --pdb70_database_path= \
- --max_template_date=2020-05-14 \
- --model_preset=multimer \
- --run_relax=true \
- --use_gpu_relax=true
--- a/scripts/download_all_data.sh
+++ b/scripts/download_all_data.sh
-#!/bin/bash
-#
-# Copyright 2021 DeepMind Technologies Limited
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#      http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-#
-# Downloads and unzips all required data for AlphaFold.
-#
-# Usage: bash download_all_data.sh /path/to/download/directory
-set -e
-if [[ $# -eq 0 ]]; then
-  echo "Error: download directory must be provided as an input argument."
-  exit 1
-fi
-if ! command -v aria2c &> /dev/null ; then
-  echo "Error: aria2c could not be found. Please install aria2c (sudo apt install aria2)."
-  exit 1
-fi
-DOWNLOAD_DIR="$1"
-DOWNLOAD_MODE="${2:-full_dbs}"  # Default mode to full_dbs.
-if [[ "${DOWNLOAD_MODE}" != full_dbs && "${DOWNLOAD_MODE}" != reduced_dbs ]]
-then
-  echo "DOWNLOAD_MODE ${DOWNLOAD_MODE} not recognized."
-  exit 1
-fi
-SCRIPT_DIR="$(dirname "$(realpath "$0")")"
-echo "Downloading AlphaFold parameters..."
-bash "${SCRIPT_DIR}/download_alphafold_params.sh" "${DOWNLOAD_DIR}"
-if [[ "${DOWNLOAD_MODE}" = reduced_dbs ]] ; then
-  echo "Downloading Small BFD..."
-  bash "${SCRIPT_DIR}/download_small_bfd.sh" "${DOWNLOAD_DIR}"
-else
-  echo "Downloading BFD..."
-  bash "${SCRIPT_DIR}/download_bfd.sh" "${DOWNLOAD_DIR}"
-fi
-echo "Downloading MGnify..."
-bash "${SCRIPT_DIR}/download_mgnify.sh" "${DOWNLOAD_DIR}"
-echo "Downloading PDB70..."
-bash "${SCRIPT_DIR}/download_pdb70.sh" "${DOWNLOAD_DIR}"
-echo "Downloading PDB mmCIF files..."
-bash "${SCRIPT_DIR}/download_pdb_mmcif.sh" "${DOWNLOAD_DIR}"
-echo "Downloading Uniclust30..."
-bash "${SCRIPT_DIR}/download_uniclust30.sh" "${DOWNLOAD_DIR}"
-echo "Downloading Uniref90..."
-bash "${SCRIPT_DIR}/download_uniref90.sh" "${DOWNLOAD_DIR}"
-echo "Downloading UniProt..."
-bash "${SCRIPT_DIR}/download_uniprot.sh" "${DOWNLOAD_DIR}"
-echo "Downloading PDB SeqRes..."
-bash "${SCRIPT_DIR}/download_pdb_seqres.sh" "${DOWNLOAD_DIR}"
-echo "All data downloaded."
--- a/scripts/download_alphafold_params.sh
+++ b/scripts/download_alphafold_params.sh
-#!/bin/bash
-#
-# Copyright 2021 DeepMind Technologies Limited
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#      http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-#
-# Downloads and unzips the AlphaFold parameters.
-#
-# Usage: bash download_alphafold_params.sh /path/to/download/directory
-set -e
-if [[ $# -eq 0 ]]; then
-    echo "Error: download directory must be provided as an input argument."
-    exit 1
-fi
-if ! command -v aria2c &> /dev/null ; then
-    echo "Error: aria2c could not be found. Please install aria2c (sudo apt install aria2)."
-    exit 1
-fi
-DOWNLOAD_DIR="$1"
-ROOT_DIR="${DOWNLOAD_DIR}/params"
-SOURCE_URL="https://storage.googleapis.com/alphafold/alphafold_params_2022-03-02.tar"
-BASENAME=$(basename "${SOURCE_URL}")
-mkdir --parents "${ROOT_DIR}"
-aria2c "${SOURCE_URL}" --dir="${ROOT_DIR}"
-tar --extract --verbose --file="${ROOT_DIR}/${BASENAME}" \
-  --directory="${ROOT_DIR}" --preserve-permissions
-rm "${ROOT_DIR}/${BASENAME}"
--- a/scripts/download_bfd.sh
+++ b/scripts/download_bfd.sh
-#!/bin/bash
-#
-# Copyright 2021 DeepMind Technologies Limited
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#      http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-#
-# Downloads and unzips the BFD database for AlphaFold.
-#
-# Usage: bash download_bfd.sh /path/to/download/directory
-set -e
-if [[ $# -eq 0 ]]; then
-    echo "Error: download directory must be provided as an input argument."
-    exit 1
-fi
-if ! command -v aria2c &> /dev/null ; then
-    echo "Error: aria2c could not be found. Please install aria2c (sudo apt install aria2)."
-    exit 1
-fi
-DOWNLOAD_DIR="$1"
-ROOT_DIR="${DOWNLOAD_DIR}/bfd"
-# Mirror of:
-# https://bfd.mmseqs.com/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt.tar.gz.
-SOURCE_URL="https://storage.googleapis.com/alphafold-databases/casp14_versions/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt.tar.gz"
-BASENAME=$(basename "${SOURCE_URL}")
-mkdir --parents "${ROOT_DIR}"
-aria2c "${SOURCE_URL}" --dir="${ROOT_DIR}"
-tar --extract --verbose --file="${ROOT_DIR}/${BASENAME}" \
-  --directory="${ROOT_DIR}"
-rm "${ROOT_DIR}/${BASENAME}"
--- a/scripts/download_mgnify.sh
+++ b/scripts/download_mgnify.sh
-#!/bin/bash
-#
-# Copyright 2021 DeepMind Technologies Limited
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#      http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-#
-# Downloads and unzips the MGnify database for AlphaFold.
-#
-# Usage: bash download_mgnify.sh /path/to/download/directory
-set -e
-if [[ $# -eq 0 ]]; then
-    echo "Error: download directory must be provided as an input argument."
-    exit 1
-fi
-if ! command -v aria2c &> /dev/null ; then
-    echo "Error: aria2c could not be found. Please install aria2c (sudo apt install aria2)."
-    exit 1
-fi
-DOWNLOAD_DIR="$1"
-ROOT_DIR="${DOWNLOAD_DIR}/mgnify"
-# Mirror of:
-# ftp://ftp.ebi.ac.uk/pub/databases/metagenomics/peptide_database/2018_12/mgy_clusters.fa.gz
-SOURCE_URL="https://storage.googleapis.com/alphafold-databases/casp14_versions/mgy_clusters_2018_12.fa.gz"
-BASENAME=$(basename "${SOURCE_URL}")
-mkdir --parents "${ROOT_DIR}"
-aria2c "${SOURCE_URL}" --dir="${ROOT_DIR}"
-pushd "${ROOT_DIR}"
-gunzip "${ROOT_DIR}/${BASENAME}"
-popd
--- a/scripts/download_pdb70.sh
+++ b/scripts/download_pdb70.sh
-#!/bin/bash
-#
-# Copyright 2021 DeepMind Technologies Limited
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#      http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-#
-# Downloads and unzips the PDB70 database for AlphaFold.
-#
-# Usage: bash download_pdb70.sh /path/to/download/directory
-set -e
-if [[ $# -eq 0 ]]; then
-    echo "Error: download directory must be provided as an input argument."
-    exit 1
-fi
-if ! command -v aria2c &> /dev/null ; then
-    echo "Error: aria2c could not be found. Please install aria2c (sudo apt install aria2)."
-    exit 1
-fi
-DOWNLOAD_DIR="$1"
-ROOT_DIR="${DOWNLOAD_DIR}/pdb70"
-SOURCE_URL="http://wwwuser.gwdg.de/~compbiol/data/hhsuite/databases/hhsuite_dbs/old-releases/pdb70_from_mmcif_200401.tar.gz"
-BASENAME=$(basename "${SOURCE_URL}")
-mkdir --parents "${ROOT_DIR}"
-aria2c "${SOURCE_URL}" --dir="${ROOT_DIR}"
-tar --extract --verbose --file="${ROOT_DIR}/${BASENAME}" \
-  --directory="${ROOT_DIR}"
-rm "${ROOT_DIR}/${BASENAME}"
--- a/scripts/download_pdb_mmcif.sh
+++ b/scripts/download_pdb_mmcif.sh
-#!/bin/bash
-#
-# Copyright 2021 DeepMind Technologies Limited
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#      http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-#
-# Downloads, unzips and flattens the PDB database for AlphaFold.
-#
-# Usage: bash download_pdb_mmcif.sh /path/to/download/directory
-set -e
-if [[ $# -eq 0 ]]; then
-    echo "Error: download directory must be provided as an input argument."
-    exit 1
-fi
-if ! command -v aria2c &> /dev/null ; then
-    echo "Error: aria2c could not be found. Please install aria2c (sudo apt install aria2)."
-    exit 1
-fi
-if ! command -v rsync &> /dev/null ; then
-    echo "Error: rsync could not be found. Please install rsync."
-    exit 1
-fi
-DOWNLOAD_DIR="$1"
-ROOT_DIR="${DOWNLOAD_DIR}/pdb_mmcif"
-RAW_DIR="${ROOT_DIR}/raw"
-MMCIF_DIR="${ROOT_DIR}/mmcif_files"
-echo "Running rsync to fetch all mmCIF files (note that the rsync progress estimate might be inaccurate)..."
-echo "If the download speed is too slow, try changing the mirror to:"
-echo "  * rsync.ebi.ac.uk::pub/databases/pdb/data/structures/divided/mmCIF/ (Europe)"
-echo "  * ftp.pdbj.org::ftp_data/structures/divided/mmCIF/ (Asia)"
-echo "or see https://www.wwpdb.org/ftp/pdb-ftp-sites for more download options."
-mkdir --parents "${RAW_DIR}"
- rsync --recursive --links --perms --times --compress --info=progress2 --delete --port=33444 \
-   rsync.rcsb.org::ftp_data/structures/divided/mmCIF/ \
-   "${RAW_DIR}"
-# rsync --recursive --links --perms --times --compress --info=progress2 --delete --port=33444 \
-#   data.pdbj.org::ftp_data/structures/divided/mmCIF/ \
-#   "${RAW_DIR}"
-echo "Unzipping all mmCIF files..."
-find "${RAW_DIR}/" -type f -iname "*.gz" -exec gunzip {} +
-echo "Flattening all mmCIF files..."
-mkdir --parents "${MMCIF_DIR}"
-find "${RAW_DIR}" -type d -empty -delete  # Delete empty directories.
-for subdir in "${RAW_DIR}"; do
-  mv "${subdir}/"*.cif "${MMCIF_DIR}"
-done
-# Delete empty download directory structure.
-find "${RAW_DIR}" -type d -empty -delete
-aria2c "ftp://ftp.wwpdb.org/pub/pdb/data/status/obsolete.dat" --dir="${ROOT_DIR}"
--- a/scripts/download_pdb_seqres.sh
+++ b/scripts/download_pdb_seqres.sh
-#!/bin/bash
-#
-# Copyright 2021 DeepMind Technologies Limited
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#      http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-#
-# Downloads and unzips the PDB SeqRes database for AlphaFold.
-#
-# Usage: bash download_pdb_seqres.sh /path/to/download/directory
-set -e
-if [[ $# -eq 0 ]]; then
-    echo "Error: download directory must be provided as an input argument."
-    exit 1
-fi
-if ! command -v aria2c &> /dev/null ; then
-    echo "Error: aria2c could not be found. Please install aria2c (sudo apt install aria2)."
-    exit 1
-fi
-DOWNLOAD_DIR="$1"
-ROOT_DIR="${DOWNLOAD_DIR}/pdb_seqres"
-SOURCE_URL="ftp://ftp.wwpdb.org/pub/pdb/derived_data/pdb_seqres.txt"
-BASENAME=$(basename "${SOURCE_URL}")
-mkdir --parents "${ROOT_DIR}"
-aria2c "${SOURCE_URL}" --dir="${ROOT_DIR}"
--- a/scripts/download_small_bfd.sh
+++ b/scripts/download_small_bfd.sh
-#!/bin/bash
-#
-# Copyright 2021 DeepMind Technologies Limited
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#      http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-#
-# Downloads and unzips the Small BFD database for AlphaFold.
-#
-# Usage: bash download_small_bfd.sh /path/to/download/directory
-set -e
-if [[ $# -eq 0 ]]; then
-    echo "Error: download directory must be provided as an input argument."
-    exit 1
-fi
-if ! command -v aria2c &> /dev/null ; then
-    echo "Error: aria2c could not be found. Please install aria2c (sudo apt install aria2)."
-    exit 1
-fi
-DOWNLOAD_DIR="$1"
-ROOT_DIR="${DOWNLOAD_DIR}/small_bfd"
-SOURCE_URL="https://storage.googleapis.com/alphafold-databases/reduced_dbs/bfd-first_non_consensus_sequences.fasta.gz"
-BASENAME=$(basename "${SOURCE_URL}")
-mkdir --parents "${ROOT_DIR}"
-aria2c "${SOURCE_URL}" --dir="${ROOT_DIR}"
-pushd "${ROOT_DIR}"
-gunzip "${ROOT_DIR}/${BASENAME}"
-popd
--- a/scripts/download_uniclust30.sh
+++ b/scripts/download_uniclust30.sh
-#!/bin/bash
-#
-# Copyright 2021 DeepMind Technologies Limited
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#      http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-#
-# Downloads and unzips the Uniclust30 database for AlphaFold.
-#
-# Usage: bash download_uniclust30.sh /path/to/download/directory
-set -e
-if [[ $# -eq 0 ]]; then
-    echo "Error: download directory must be provided as an input argument."
-    exit 1
-fi
-if ! command -v aria2c &> /dev/null ; then
-    echo "Error: aria2c could not be found. Please install aria2c (sudo apt install aria2)."
-    exit 1
-fi
-DOWNLOAD_DIR="$1"
-ROOT_DIR="${DOWNLOAD_DIR}/uniclust30"
-# Mirror of:
-# http://wwwuser.gwdg.de/~compbiol/uniclust/2018_08/uniclust30_2018_08_hhsuite.tar.gz
-SOURCE_URL="https://storage.googleapis.com/alphafold-databases/casp14_versions/uniclust30_2018_08_hhsuite.tar.gz"
-BASENAME=$(basename "${SOURCE_URL}")
-mkdir --parents "${ROOT_DIR}"
-aria2c "${SOURCE_URL}" --dir="${ROOT_DIR}"
-tar --extract --verbose --file="${ROOT_DIR}/${BASENAME}" \
-  --directory="${ROOT_DIR}"
-rm "${ROOT_DIR}/${BASENAME}"
--- a/scripts/download_uniprot.sh
+++ b/scripts/download_uniprot.sh
-#!/bin/bash
-#
-# Copyright 2021 DeepMind Technologies Limited
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#      http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-#
-# Downloads, unzips and merges the SwissProt and TrEMBL databases for
-# AlphaFold-Multimer.
-#
-# Usage: bash download_uniprot.sh /path/to/download/directory
-set -e
-if [[ $# -eq 0 ]]; then
-    echo "Error: download directory must be provided as an input argument."
-    exit 1
-fi
-if ! command -v aria2c &> /dev/null ; then
-    echo "Error: aria2c could not be found. Please install aria2c (sudo apt install aria2)."
-    exit 1
-fi
-DOWNLOAD_DIR="$1"
-ROOT_DIR="${DOWNLOAD_DIR}/uniprot"
-TREMBL_SOURCE_URL="ftp://ftp.ebi.ac.uk/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_trembl.fasta.gz"
-TREMBL_BASENAME=$(basename "${TREMBL_SOURCE_URL}")
-TREMBL_UNZIPPED_BASENAME="${TREMBL_BASENAME%.gz}"
-SPROT_SOURCE_URL="ftp://ftp.ebi.ac.uk/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz"
-SPROT_BASENAME=$(basename "${SPROT_SOURCE_URL}")
-SPROT_UNZIPPED_BASENAME="${SPROT_BASENAME%.gz}"
-mkdir --parents "${ROOT_DIR}"
-aria2c "${TREMBL_SOURCE_URL}" --dir="${ROOT_DIR}"
-aria2c "${SPROT_SOURCE_URL}" --dir="${ROOT_DIR}"
-pushd "${ROOT_DIR}"
-gunzip "${ROOT_DIR}/${TREMBL_BASENAME}"
-gunzip "${ROOT_DIR}/${SPROT_BASENAME}"
-# Concatenate TrEMBL and SwissProt, rename to uniprot and clean up.
-cat "${ROOT_DIR}/${SPROT_UNZIPPED_BASENAME}" >> "${ROOT_DIR}/${TREMBL_UNZIPPED_BASENAME}"
-mv "${ROOT_DIR}/${TREMBL_UNZIPPED_BASENAME}" "${ROOT_DIR}/uniprot.fasta"
-rm "${ROOT_DIR}/${SPROT_UNZIPPED_BASENAME}"
-popd
--- a/scripts/download_uniref90.sh
+++ b/scripts/download_uniref90.sh
-#!/bin/bash
-#
-# Copyright 2021 DeepMind Technologies Limited
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#      http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-#
-# Downloads and unzips the UniRef90 database for AlphaFold.
-#
-# Usage: bash download_uniref90.sh /path/to/download/directory
-set -e
-if [[ $# -eq 0 ]]; then
-    echo "Error: download directory must be provided as an input argument."
-    exit 1
-fi
-if ! command -v aria2c &> /dev/null ; then
-    echo "Error: aria2c could not be found. Please install aria2c (sudo apt install aria2)."
-    exit 1
-fi
-DOWNLOAD_DIR="$1"
-ROOT_DIR="${DOWNLOAD_DIR}/uniref90"
-SOURCE_URL="ftp://ftp.uniprot.org/pub/databases/uniprot/uniref/uniref90/uniref90.fasta.gz"
-BASENAME=$(basename "${SOURCE_URL}")
-mkdir --parents "${ROOT_DIR}"
-aria2c "${SOURCE_URL}" --dir="${ROOT_DIR}"
-pushd "${ROOT_DIR}"
-gunzip "${ROOT_DIR}/${BASENAME}"
-popd