Commit 0bab1bf8 authored by Saran Tunyasuvunakool's avatar Saran Tunyasuvunakool
Browse files

Add a Colab notebook, add reduced BFD, and various other fixes and improvements.

PiperOrigin-RevId: 386228948
parent d26287ea
......@@ -9,7 +9,15 @@ of this document.
Any publication that discloses findings arising from using this source code or
the model parameters should [cite](#citing-this-work) the
[AlphaFold paper](https://doi.org/10.1038/s41586-021-03819-2).
[AlphaFold paper](https://doi.org/10.1038/s41586-021-03819-2). Please also refer
to the
[Supplementary Information](https://static-content.springer.com/esm/art%3A10.1038%2Fs41586-021-03819-2/MediaObjects/41586_2021_3819_MOESM1_ESM.pdf)
for a detailed description of the method.
**You can use a slightly simplified version of AlphaFold with
[this Colab
notebook](https://colab.research.google.com/github/deepmind/alphafold/blob/main/notebooks/AlphaFold.ipynb)**
or community-supported versions (see below).
![CASP14 predictions](imgs/casp14_predictions.gif)
......@@ -39,7 +47,7 @@ The following steps are required in order to run AlphaFold:
### Genetic databases
This step requires `rsync` and `aria2c` to be installed on your machine.
This step requires `aria2c` to be installed on your machine.
AlphaFold needs multiple genetic (sequence) databases to run:
......@@ -51,21 +59,43 @@ AlphaFold needs multiple genetic (sequence) databases to run:
* [PDB](https://www.rcsb.org/) (structures in the mmCIF format).
We provide a script `scripts/download_all_data.sh` that can be used to download
and set up all of these databases. This should take 8–12 hours.
and set up all of these databases:
* Default:
:ledger: **Note: The total download size is around 428 GB and the total size
when unzipped is 2.2 TB. Please make sure you have a large enough hard drive
space, bandwidth and time to download.**
```bash
scripts/download_all_data.sh <DOWNLOAD_DIR>
```
will download the full databases.
* With `reduced_dbs`:
```bash
scripts/download_all_data.sh <DOWNLOAD_DIR> reduced_dbs
```
will download a reduced version of the databases to be used with the
`reduced_dbs` preset.
We don't provide exactly the versions used in CASP14 -- see the [note on
reproducibility](#note-on-reproducibility). Some of the databases are mirrored
for speed, see [mirrored databases](#mirrored-databases).
:ledger: **Note: The total download size for the full databases is around 415 GB
and the total size when unzipped is 2.2 TB. Please make sure you have a large
enough hard drive space, bandwidth and time to download. We recommend using an
SSD for better genetic search performance.**
This script will also download the model parameter files. Once the script has
finished, you should have the following directory structure:
```
$DOWNLOAD_DIR/ # Total: ~ 2.2 TB (download: 428 GB)
bfd/ # ~ 1.8 TB (download: 271.6 GB)
$DOWNLOAD_DIR/ # Total: ~ 2.2 TB (download: 438 GB)
bfd/ # ~ 1.7 TB (download: 271.6 GB)
# 6 files.
mgnify/ # ~ 64 GB (download: 32.9 GB)
mgy_clusters.fa
mgy_clusters_2018_08.fa
params/ # ~ 3.5 GB (download: 3.5 GB)
# 5 CASP14 models,
# 5 pTM models,
......@@ -77,13 +107,18 @@ $DOWNLOAD_DIR/ # Total: ~ 2.2 TB (download: 428 GB)
mmcif_files/
# About 180,000 .cif files.
obsolete.dat
uniclust30/ # ~ 87 GB (download: 24.9 GB)
small_fbd/ # ~ 17 GB (download: 9.6 GB)
bfd-first_non_consensus_sequences.fasta
uniclust30/ # ~ 86 GB (download: 24.9 GB)
uniclust30_2018_08/
# 13 files.
uniref90/ # ~ 59 GB (download: 29.7 GB)
uniref90/ # ~ 58 GB (download: 29.7 GB)
uniref90.fasta
```
`bfd/` is only downloaded if you download the full databasees, and `small_bfd/`
is only downloaded if you download the reduced databases.
### Model parameters
While the AlphaFold code is licensed under the Apache 2.0 License, the AlphaFold
......@@ -149,16 +184,20 @@ with 12 vCPUs, 85 GB of RAM, a 100 GB boot disk, the databases on an additional
[GPU enumeration](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/user-guide.html#gpu-enumeration)
for more details.
1. You can control AlphaFold speed / quality tradeoff by adding either
`--preset=full_dbs` or `--preset=casp14` to the run command. We provide the
following presets:
1. You can control AlphaFold speed / quality tradeoff by adding
`--preset=reduced_dbs`, `--preset=full_dbs` or `--preset=casp14` to the run
command. We provide the following presets:
* **casp14**: This preset uses the same settings as were used in CASP14.
It runs with all genetic databases and with 8 ensemblings.
* **reduced_dbs**: This preset is optimized for speed and lower hardware
requirements. It runs with a reduced version of the BFD database and
with no ensembling. It requires 8 CPU cores (vCPUs), 8 GB of RAM, and
600 GB of disk space.
* **full_dbs**: The model in this preset is 8 times faster than the
`casp14` preset with a very minor quality drop (-0.1 average GDT drop on
CASP14 domains). It runs with all genetic databases and with no
ensembling.
* **casp14**: This preset uses the same settings as were used in CASP14.
It runs with all genetic databases and with 8 ensemblings.
Running the command above with the `casp14` preset would look like this:
......@@ -174,7 +213,7 @@ structures, raw model outputs, prediction metadata, and section timings. The
`output_dir` directory will have the following structure:
```
output_dir/
<target_name>/
features.pkl
ranked_{0,1,2,3,4}.pdb
ranking_debug.json
......@@ -190,20 +229,20 @@ output_dir/
The contents of each output file are as follows:
* `features.pkl` – A `pickle` file containing the input feature Numpy arrays
* `features.pkl` – A `pickle` file containing the input feature NumPy arrays
used by the models to produce the structures.
* `unrelaxed_model_*.pdb` – A PDB format text file containing the predicted
structure, exactly as outputted by the model.
* `relaxed_model_*.pdb` – A PDB format text file containing the predicted
structure, after performing an Amber relaxation procedure on the unrelaxed
structure prediction, see Jumper et al. 2021, Suppl. Methods 1.8.6 for
details.
structure prediction (see Jumper et al. 2021, Suppl. Methods 1.8.6 for
details).
* `ranked_*.pdb` – A PDB format text file containing the relaxed predicted
structures, after reordering by model confidence. Here `ranked_0.pdb` should
contain the prediction with the highest confidence, and `ranked_4.pdb` the
prediction with the lowest confidence. To rank model confidence, we use
predicted LDDT (pLDDT), see Jumper et al. 2021, Suppl. Methods 1.9.6 for
details.
predicted LDDT (pLDDT) scores (see Jumper et al. 2021, Suppl. Methods 1.9.6
for details).
* `ranking_debug.json` – A JSON format text file containing the pLDDT values
used to perform the model ranking, and a mapping back to the original model
names.
......@@ -212,10 +251,27 @@ The contents of each output file are as follows:
* `msas/` - A directory containing the files describing the various genetic
tool hits that were used to construct the input MSA.
* `result_model_*.pkl` – A `pickle` file containing a nested dictionary of the
various Numpy arrays directly produced by the model. In addition to the
output of the structure module, this includes auxiliary outputs such as
distograms and pLDDT scores. If using the pTM models then the pTM logits
will also be contained in this file.
various NumPy arrays directly produced by the model. In addition to the
output of the structure module, this includes auxiliary outputs such as:
* Distograms (`distogram/logits` contains a NumPy array of shape [N_res,
N_res, N_bins] and `distogram/bin_edges` contains the definition of the
bins).
* Per-residue pLDDT scores (`plddt` contains a NumPy array of shape
[N_res] with the range of possible values from `0` to `100`, where `100`
means most confident). This can serve to identify sequence regions
predicted with high confidence or as an overall per-target confidence
score when averaged across residues.
* Present only if using pTM models: predicted TM-score (`ptm` field
contains a scalar). As a predictor of a global superposition metric,
this score is designed to also assess whether the model is confident in
the overall domain packing.
* Present only if using pTM models: predicted pairwise aligned errors
(`predicted_aligned_error` contains a NumPy array of shape [N_res,
N_res] with the range of possible values from `0` to
`max_predicted_aligned_error`, where `0` means most confident). This can
serve for a visualisation of domain packing confidence within the
structure.
This code has been tested to match mean top-1 accuracy on a CASP14 test set with
pLDDT ranking over 5 model predictions (some CASP targets were run with earlier
......@@ -284,6 +340,17 @@ If you use the code or data in this package, please cite:
}
```
## Community contributions
Colab notebooks provided by the community (please note that these notebooks may
vary from our full AlphaFold system and we did not validate their accuracy):
* The [ColabFold AlphaFold2 notebook](https://colab.research.google.com/github/sokrypton/ColabFold/blob/main/AlphaFold2.ipynb)
by Martin Steinegger, Sergey Ovchinnikov and Milot Mirdita, which uses an
API hosted at the Södinglab based on the MMseqs2 server [(Mirdita et al.
2019, Bioinformatics)](https://academic.oup.com/bioinformatics/article/35/16/2856/5280135)
for the multiple sequence alignment creation.
## Acknowledgements
AlphaFold communicates with and/or references the following separate libraries
......@@ -292,6 +359,7 @@ and packages:
* [Abseil](https://github.com/abseil/abseil-py)
* [Biopython](https://biopython.org)
* [Chex](https://github.com/deepmind/chex)
* [Colab](https://research.google.com/colaboratory/)
* [Docker](https://www.docker.com)
* [HH Suite](https://github.com/soedinglab/hh-suite)
* [HMMER Suite](http://eddylab.org/software/hmmer)
......@@ -299,18 +367,20 @@ and packages:
* [Immutabledict](https://github.com/corenting/immutabledict)
* [JAX](https://github.com/google/jax/)
* [Kalign](https://msa.sbc.su.se/cgi-bin/msa.cgi)
* [matplotlib](https://matplotlib.org/)
* [ML Collections](https://github.com/google/ml_collections)
* [NumPy](https://numpy.org)
* [OpenMM](https://github.com/openmm/openmm)
* [OpenStructure](https://openstructure.org)
* [pymol3d](https://github.com/avirshup/py3dmol)
* [SciPy](https://scipy.org)
* [Sonnet](https://github.com/deepmind/sonnet)
* [TensorFlow](https://github.com/tensorflow/tensorflow)
* [Tree](https://github.com/deepmind/tree)
* [tqdm](https://github.com/tqdm/tqdm)
We thank all their contributors and maintainers!
## License and Disclaimer
This is not an officially supported Google product.
......@@ -349,3 +419,10 @@ before use.
The following databases have been mirrored by DeepMind, and are available with reference to the following:
* [BFD](https://bfd.mmseqs.com/) (unmodified), by Steinegger M. and Söding J., available under a [Creative Commons Attribution-ShareAlike 4.0 International License](http://creativecommons.org/licenses/by-sa/4.0/).
* [BFD](https://bfd.mmseqs.com/) (modified), by Steinegger M. and Söding J., modified by DeepMind, available under a [Creative Commons Attribution-ShareAlike 4.0 International License](http://creativecommons.org/licenses/by-sa/4.0/). See the Methods section of the [AlphaFold proteome paper]
(https://www.nature.com/articles/s41586-021-03828-1) for details.
* [Uniclust30: v2018_08](http://wwwuser.gwdg.de/~compbiol/uniclust/2018_08/) (unmodified), by Mirdita M. et al., available under a [Creative Commons Attribution-ShareAlike 4.0 International License](http://creativecommons.org/licenses/by-sa/4.0/).
* [MGnify: v2018_12](http://ftp.ebi.ac.uk/pub/databases/metagenomics/peptide_database/current_release/README.txt) (unmodified), by Mitchell AL et al., available free of all copyright restrictions and made fully and freely available for both non-commercial and commercial use under [CC0 1.0 Universal (CC0 1.0) Public Domain Dedication](https://creativecommons.org/publicdomain/zero/1.0/).
......@@ -67,7 +67,7 @@ def from_pdb_string(pdb_str: str, chain_id: Optional[str] = None) -> Protein:
A new `Protein` parsed from the pdb contents.
"""
pdb_fh = io.StringIO(pdb_str)
parser = PDBParser()
parser = PDBParser(QUIET=True)
structure = parser.get_structure('none', pdb_fh)
models = list(structure.get_models())
if len(models) != 1:
......@@ -207,22 +207,25 @@ def ideal_atom_mask(prot: Protein) -> np.ndarray:
return residue_constants.STANDARD_ATOM_MASK[prot.aatype]
def from_prediction(features: FeatureDict, result: ModelOutput) -> Protein:
def from_prediction(features: FeatureDict, result: ModelOutput,
b_factors: Optional[np.ndarray] = None) -> Protein:
"""Assembles a protein from a prediction.
Args:
features: Dictionary holding model inputs.
result: Dictionary holding model outputs.
b_factors: (Optional) B-factors to use for the protein.
Returns:
A protein instance.
"""
fold_output = result['structure_module']
dist_per_residue = np.zeros_like(fold_output['final_atom_mask'])
if b_factors is None:
b_factors = np.zeros_like(fold_output['final_atom_mask'])
return Protein(
aatype=features['aatype'][0],
atom_positions=fold_output['final_atom_positions'],
atom_mask=fold_output['final_atom_mask'],
residue_index=features['residue_index'][0] + 1,
b_factors=dist_per_residue)
b_factors=b_factors)
......@@ -16,7 +16,7 @@
import collections
import re
import string
from typing import Iterable, List, Optional, Sequence, Tuple
from typing import Iterable, List, Optional, Sequence, Tuple, Dict
import dataclasses
......@@ -24,23 +24,14 @@ DeletionMatrix = Sequence[Sequence[int]]
@dataclasses.dataclass(frozen=True)
class HhrHit:
"""Class representing a hit in an hhr file."""
class TemplateHit:
"""Class representing a template hit."""
index: int
name: str
prob_true: float
e_value: float
score: float
aligned_cols: int
identity: float
similarity: float
sum_probs: float
neff: float
query: str
hit_sequence: str
hit_dssp: str
column_score_code: str
confidence_scores: str
indices_query: List[int]
indices_hit: List[int]
......@@ -75,7 +66,8 @@ def parse_fasta(fasta_string: str) -> Tuple[Sequence[str], Sequence[str]]:
def parse_stockholm(
stockholm_string: str) -> Tuple[Sequence[str], DeletionMatrix]:
stockholm_string: str
) -> Tuple[Sequence[str], DeletionMatrix, Sequence[str]]:
"""Parses sequences and deletion matrix from stockholm format alignment.
Args:
......@@ -89,6 +81,8 @@ def parse_stockholm(
* The deletion matrix for the alignment as a list of lists. The element
at `deletion_matrix[i][j]` is the number of residues deleted from
the aligned sequence i at residue position j.
* The names of the targets matched, including the jackhmmer subsequence
suffix.
"""
name_to_sequence = collections.OrderedDict()
for line in stockholm_string.splitlines():
......@@ -128,7 +122,7 @@ def parse_stockholm(
deletion_count = 0
deletion_matrix.append(deletion_vec)
return msa, deletion_matrix
return msa, deletion_matrix, list(name_to_sequence.keys())
def parse_a3m(a3m_string: str) -> Tuple[Sequence[str], DeletionMatrix]:
......@@ -242,7 +236,7 @@ def _update_hhr_residue_indices_list(
counter += 1
def _parse_hhr_hit(detailed_lines: Sequence[str]) -> HhrHit:
def _parse_hhr_hit(detailed_lines: Sequence[str]) -> TemplateHit:
"""Parses the detailed HMM HMM comparison section for a single Hit.
This works on .hhr files generated from both HHBlits and HHSearch.
......@@ -271,7 +265,7 @@ def _parse_hhr_hit(detailed_lines: Sequence[str]) -> HhrHit:
raise RuntimeError(
'Could not parse section: %s. Expected this: \n%s to contain summary.' %
(detailed_lines, detailed_lines[2]))
(prob_true, e_value, score, aligned_cols, identity, similarity, sum_probs,
(prob_true, e_value, _, aligned_cols, _, _, sum_probs,
neff) = [float(x) for x in match.groups()]
# The next section reads the detailed comparisons. These are in a 'human
......@@ -280,9 +274,6 @@ def _parse_hhr_hit(detailed_lines: Sequence[str]) -> HhrHit:
# that with a regexp in order to deduce the fixed length used for that block.
query = ''
hit_sequence = ''
hit_dssp = ''
column_score_code = ''
confidence_scores = ''
indices_query = []
indices_hit = []
length_block = None
......@@ -312,17 +303,10 @@ def _parse_hhr_hit(detailed_lines: Sequence[str]) -> HhrHit:
_update_hhr_residue_indices_list(delta_query, start, indices_query)
elif line.startswith('T '):
# Parse the hit dssp line.
if line.startswith('T ss_dssp'):
# T ss_dssp hit_dssp
patt = r'T ss_dssp[\t ]*([A-Z-]*)'
groups = _get_hhr_line_regex_groups(patt, line)
assert len(groups[0]) == length_block
hit_dssp += groups[0]
# Parse the hit sequence.
elif (not line.startswith('T ss_pred') and
not line.startswith('T Consensus')):
if (not line.startswith('T ss_dssp') and
not line.startswith('T ss_pred') and
not line.startswith('T Consensus')):
# Thus the first 17 characters must be 'T <hit_name> ', and we can
# parse everything after that.
# start sequence end total_sequence_length
......@@ -336,38 +320,19 @@ def _parse_hhr_hit(detailed_lines: Sequence[str]) -> HhrHit:
hit_sequence += delta_hit_sequence
_update_hhr_residue_indices_list(delta_hit_sequence, start, indices_hit)
# Parse the column score line.
elif line.startswith(' ' * 22):
assert length_block
column_score_code += line[22:length_block + 22]
# Update confidence score.
elif line.startswith('Confidence'):
assert length_block
confidence_scores += line[22:length_block + 22]
return HhrHit(
return TemplateHit(
index=number_of_hit,
name=name_hit,
prob_true=prob_true,
e_value=e_value,
score=score,
aligned_cols=int(aligned_cols),
identity=identity,
similarity=similarity,
sum_probs=sum_probs,
neff=neff,
query=query,
hit_sequence=hit_sequence,
hit_dssp=hit_dssp,
column_score_code=column_score_code,
confidence_scores=confidence_scores,
indices_query=indices_query,
indices_hit=indices_hit,
)
def parse_hhr(hhr_string: str) -> Sequence[HhrHit]:
def parse_hhr(hhr_string: str) -> Sequence[TemplateHit]:
"""Parses the content of an entire HHR file."""
lines = hhr_string.splitlines()
......@@ -383,3 +348,18 @@ def parse_hhr(hhr_string: str) -> Sequence[HhrHit]:
for i in range(len(block_starts) - 1):
hits.append(_parse_hhr_hit(lines[block_starts[i]:block_starts[i + 1]]))
return hits
def parse_e_values_from_tblout(tblout: str) -> Dict[str, float]:
"""Parse target to e-value mapping parsed from Jackhmmer tblout string."""
e_values = {'query': 0}
lines = [line for line in tblout.splitlines() if line[0] != '#']
# As per http://eddylab.org/software/hmmer/Userguide.pdf fields are
# space-delimited. Relevant fields are (1) target name: and
# (5) E-value (full sequence) (numbering from 1).
for line in lines:
fields = line.split()
e_value = fields[4]
target_name = fields[0]
e_values[target_name] = float(e_value)
return e_values
......@@ -15,7 +15,7 @@
"""Functions for building the input features for the AlphaFold model."""
import os
from typing import Mapping, Sequence
from typing import Mapping, Optional, Sequence
import numpy as np
......@@ -88,19 +88,27 @@ class DataPipeline:
hhsearch_binary_path: str,
uniref90_database_path: str,
mgnify_database_path: str,
bfd_database_path: str,
uniclust30_database_path: str,
bfd_database_path: Optional[str],
uniclust30_database_path: Optional[str],
small_bfd_database_path: Optional[str],
pdb70_database_path: str,
template_featurizer: templates.TemplateHitFeaturizer,
use_small_bfd: bool,
mgnify_max_hits: int = 501,
uniref_max_hits: int = 10000):
"""Constructs a feature dict for a given FASTA file."""
self._use_small_bfd = use_small_bfd
self.jackhmmer_uniref90_runner = jackhmmer.Jackhmmer(
binary_path=jackhmmer_binary_path,
database_path=uniref90_database_path)
self.hhblits_bfd_uniclust_runner = hhblits.HHBlits(
binary_path=hhblits_binary_path,
databases=[bfd_database_path, uniclust30_database_path])
if use_small_bfd:
self.jackhmmer_small_bfd_runner = jackhmmer.Jackhmmer(
binary_path=jackhmmer_binary_path,
database_path=small_bfd_database_path)
else:
self.hhblits_bfd_uniclust_runner = hhblits.HHBlits(
binary_path=hhblits_binary_path,
databases=[bfd_database_path, uniclust30_database_path])
self.jackhmmer_mgnify_runner = jackhmmer.Jackhmmer(
binary_path=jackhmmer_binary_path,
database_path=mgnify_database_path)
......@@ -124,9 +132,9 @@ class DataPipeline:
num_res = len(input_sequence)
jackhmmer_uniref90_result = self.jackhmmer_uniref90_runner.query(
input_fasta_path)
input_fasta_path)[0]
jackhmmer_mgnify_result = self.jackhmmer_mgnify_runner.query(
input_fasta_path)
input_fasta_path)[0]
uniref90_msa_as_a3m = parsers.convert_stockholm_to_a3m(
jackhmmer_uniref90_result['sto'], max_sequences=self.uniref_max_hits)
......@@ -140,29 +148,40 @@ class DataPipeline:
with open(mgnify_out_path, 'w') as f:
f.write(jackhmmer_mgnify_result['sto'])
uniref90_msa, uniref90_deletion_matrix = parsers.parse_stockholm(
uniref90_msa, uniref90_deletion_matrix, _ = parsers.parse_stockholm(
jackhmmer_uniref90_result['sto'])
mgnify_msa, mgnify_deletion_matrix = parsers.parse_stockholm(
mgnify_msa, mgnify_deletion_matrix, _ = parsers.parse_stockholm(
jackhmmer_mgnify_result['sto'])
hhsearch_hits = parsers.parse_hhr(hhsearch_result)
mgnify_msa = mgnify_msa[:self.mgnify_max_hits]
mgnify_deletion_matrix = mgnify_deletion_matrix[:self.mgnify_max_hits]
hhblits_bfd_uniclust_result = self.hhblits_bfd_uniclust_runner.query(
input_fasta_path)
if self._use_small_bfd:
jackhmmer_small_bfd_result = self.jackhmmer_small_bfd_runner.query(
input_fasta_path)[0]
bfd_out_path = os.path.join(msa_output_dir, 'bfd_uniclust_hits.a3m')
with open(bfd_out_path, 'w') as f:
f.write(hhblits_bfd_uniclust_result['a3m'])
bfd_out_path = os.path.join(msa_output_dir, 'small_bfd_hits.a3m')
with open(bfd_out_path, 'w') as f:
f.write(jackhmmer_small_bfd_result['sto'])
bfd_msa, bfd_deletion_matrix = parsers.parse_a3m(
hhblits_bfd_uniclust_result['a3m'])
bfd_msa, bfd_deletion_matrix, _ = parsers.parse_stockholm(
jackhmmer_small_bfd_result['sto'])
else:
hhblits_bfd_uniclust_result = self.hhblits_bfd_uniclust_runner.query(
input_fasta_path)
bfd_out_path = os.path.join(msa_output_dir, 'bfd_uniclust_hits.a3m')
with open(bfd_out_path, 'w') as f:
f.write(hhblits_bfd_uniclust_result['a3m'])
bfd_msa, bfd_deletion_matrix = parsers.parse_a3m(
hhblits_bfd_uniclust_result['a3m'])
templates_result = self.template_featurizer.get_templates(
query_sequence=input_sequence,
query_pdb_code=None,
query_release_date=None,
hhr_hits=hhsearch_hits)
hits=hhsearch_hits)
sequence_features = make_sequence_features(
sequence=input_sequence,
......
......@@ -93,19 +93,12 @@ TEMPLATE_FEATURES = {
'template_all_atom_masks': np.float32,
'template_all_atom_positions': np.float32,
'template_domain_names': np.object,
'template_e_value': np.float32,
'template_neff': np.float32,
'template_prob_true': np.float32,
'template_release_date': np.object,
'template_score': np.float32,
'template_similarity': np.float32,
'template_sequence': np.object,
'template_sum_probs': np.float32,
'template_confidence_scores': np.int64
}
def _get_pdb_id_and_chain(hit: parsers.HhrHit) -> Tuple[str, str]:
def _get_pdb_id_and_chain(hit: parsers.TemplateHit) -> Tuple[str, str]:
"""Returns PDB id and chain id for an HHSearch Hit."""
# PDB ID: 4 letters. Chain ID: 1+ alphanumeric letters or "." if unknown.
id_match = re.match(r'[a-zA-Z\d]{4}_[a-zA-Z0-9.]+', hit.name)
......@@ -175,7 +168,7 @@ def _parse_release_dates(path: str) -> Mapping[str, datetime.datetime]:
def _assess_hhsearch_hit(
hit: parsers.HhrHit,
hit: parsers.TemplateHit,
hit_pdb_code: str,
query_sequence: str,
query_pdb_code: Optional[str],
......@@ -487,7 +480,6 @@ def _extract_template_features(
template_sequence: str,
query_sequence: str,
template_chain_id: str,
confidence_scores: str,
kalign_binary_path: str) -> Tuple[Dict[str, Any], Optional[str]]:
"""Parses atom positions in the target structure and aligns with the query.
......@@ -495,21 +487,6 @@ def _extract_template_features(
with their corresponding residue in the query sequence, according to the
alignment mapping provided.
Note that we only extract at most 500 templates because of HHSearch settings.
We set missing/invalid confidence scores to the default value of -1.
Note: We now have 4 types of confidence scores:
1. Valid scores
2. Invalid scores of residues not in both the query sequence and template
sequence
3. Missing scores because we don't have the secondary structure, and HHAlign
doesn't produce the posterior probabilities in this case.
4. Missing scores because of a different template sequence in PDB70,
invalidating the previously computed confidence scores. (Though in theory
HHAlign can be run on these to recompute the correct confidence scores).
We handle invalid and missing scores by setting them to -1, but consider
adding masks for the different types.
Args:
mmcif_object: mmcif_parsing.MmcifObject representing the template.
pdb_id: PDB code for the template.
......@@ -521,11 +498,6 @@ def _extract_template_features(
protein.
template_chain_id: String ID describing which chain in the structure proto
should be used.
confidence_scores: String containing per-residue confidence scores, where
each character represents the *TRUNCATED* posterior probability that the
corresponding template residue is correctly aligned with the query
residue, given the database match is correct (0 corresponds approximately
to 0-10%, 9 to 90-100%).
kalign_binary_path: The path to a kalign executable used for template
realignment.
......@@ -577,8 +549,6 @@ def _extract_template_features(
template_sequence = seqres
# No mapping offset, the query is aligned to the actual sequence.
mapping_offset = 0
# Confidence scores were based on the previous sequence, so they are invalid
confidence_scores = None
try:
# Essentially set to infinity - we don't want to reject templates unless
......@@ -594,7 +564,6 @@ def _extract_template_features(
all_atom_masks = np.split(all_atom_mask, all_atom_mask.shape[0])
output_templates_sequence = []
output_confidence_scores = []
templates_all_atom_positions = []
templates_all_atom_masks = []
......@@ -604,15 +573,12 @@ def _extract_template_features(
np.zeros((residue_constants.atom_type_num, 3)))
templates_all_atom_masks.append(np.zeros(residue_constants.atom_type_num))
output_templates_sequence.append('-')
output_confidence_scores.append(-1)
for k, v in mapping.items():
template_index = v + mapping_offset
templates_all_atom_positions[k] = all_atom_positions[template_index][0]
templates_all_atom_masks[k] = all_atom_masks[template_index][0]
output_templates_sequence[k] = template_sequence[v]
if confidence_scores and confidence_scores[v] != ' ':
output_confidence_scores[k] = int(confidence_scores[v])
# Alanine (AA with the lowest number of atoms) has 5 atoms (C, CA, CB, N, O).
if np.sum(templates_all_atom_masks) < 5:
......@@ -627,13 +593,13 @@ def _extract_template_features(
output_templates_sequence, residue_constants.HHBLITS_AA_TO_ID)
return (
{'template_all_atom_positions': np.array(templates_all_atom_positions),
'template_all_atom_masks': np.array(templates_all_atom_masks),
'template_sequence': output_templates_sequence.encode(),
'template_aatype': np.array(templates_aatype),
'template_confidence_scores': np.array(output_confidence_scores),
'template_domain_names': f'{pdb_id.lower()}_{chain_id}'.encode(),
'template_release_date': mmcif_object.header['release_date'].encode()},
{
'template_all_atom_positions': np.array(templates_all_atom_positions),
'template_all_atom_masks': np.array(templates_all_atom_masks),
'template_sequence': output_templates_sequence.encode(),
'template_aatype': np.array(templates_aatype),
'template_domain_names': f'{pdb_id.lower()}_{chain_id}'.encode(),
},
warning)
......@@ -704,7 +670,7 @@ class SingleHitResult:
def _process_single_hit(
query_sequence: str,
query_pdb_code: Optional[str],
hit: parsers.HhrHit,
hit: parsers.TemplateHit,
mmcif_dir: str,
max_template_date: datetime.datetime,
release_dates: Mapping[str, datetime.datetime],
......@@ -745,9 +711,6 @@ def _process_single_hit(
# The mapping is from the query to the actual hit sequence, so we need to
# remove gaps (which regardless have a missing confidence score).
template_sequence = hit.hit_sequence.replace('-', '')
confidence_scores = ''.join(
[cs for t, cs in zip(hit.hit_sequence, hit.confidence_scores)
if t != '-'])
cif_path = os.path.join(mmcif_dir, hit_pdb_code + '.cif')
logging.info('Reading PDB entry from %s. Query: %s, template: %s',
......@@ -779,14 +742,8 @@ def _process_single_hit(
template_sequence=template_sequence,
query_sequence=query_sequence,
template_chain_id=hit_chain_id,
confidence_scores=confidence_scores,
kalign_binary_path=kalign_binary_path)
features['template_e_value'] = [hit.e_value]
features['template_sum_probs'] = [hit.sum_probs]
features['template_prob_true'] = [hit.prob_true]
features['template_score'] = [hit.score]
features['template_neff'] = [hit.neff]
features['template_similarity'] = [hit.similarity]
# It is possible there were some errors when parsing the other chains in the
# mmCIF file, but the template features for the chain we want were still
......@@ -887,7 +844,7 @@ class TemplateHitFeaturizer:
query_sequence: str,
query_pdb_code: Optional[str],
query_release_date: Optional[datetime.datetime],
hhr_hits: Sequence[parsers.HhrHit]) -> TemplateSearchResult:
hits: Sequence[parsers.TemplateHit]) -> TemplateSearchResult:
"""Computes the templates for given query sequence (more details above)."""
logging.info('Searching for template for: %s', query_pdb_code)
......@@ -909,8 +866,8 @@ class TemplateHitFeaturizer:
errors = []
warnings = []
for hit in sorted(hhr_hits, key=lambda x: x.sum_probs, reverse=True):
# We got all the templates we wanted, stop processing HHSearch hits.
for hit in sorted(hits, key=lambda x: x.sum_probs, reverse=True):
# We got all the templates we wanted, stop processing hits.
if num_hits >= self._max_hits:
break
......
# Copyright 2021 DeepMind Technologies Limited
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Python wrappers for third party tools."""
# Copyright 2021 DeepMind Technologies Limited
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""A Python wrapper for hmmbuild - construct HMM profiles from MSA."""
import os
import re
import subprocess
from absl import logging
# Internal import (7716).
from alphafold.data.tools import utils
class Hmmbuild(object):
"""Python wrapper of the hmmbuild binary."""
def __init__(self,
*,
binary_path: str,
singlemx: bool = False):
"""Initializes the Python hmmbuild wrapper.
Args:
binary_path: The path to the hmmbuild executable.
singlemx: Whether to use --singlemx flag. If True, it forces HMMBuild to
just use a common substitution score matrix.
Raises:
RuntimeError: If hmmbuild binary not found within the path.
"""
self.binary_path = binary_path
self.singlemx = singlemx
def build_profile_from_sto(self, sto: str, model_construction='fast') -> str:
"""Builds a HHM for the aligned sequences given as an A3M string.
Args:
sto: A string with the aligned sequences in the Stockholm format.
model_construction: Whether to use reference annotation in the msa to
determine consensus columns ('hand') or default ('fast').
Returns:
A string with the profile in the HMM format.
Raises:
RuntimeError: If hmmbuild fails.
"""
return self._build_profile(sto, model_construction=model_construction)
def build_profile_from_a3m(self, a3m: str) -> str:
"""Builds a HHM for the aligned sequences given as an A3M string.
Args:
a3m: A string with the aligned sequences in the A3M format.
Returns:
A string with the profile in the HMM format.
Raises:
RuntimeError: If hmmbuild fails.
"""
lines = []
for line in a3m.splitlines():
if not line.startswith('>'):
line = re.sub('[a-z]+', '', line) # Remove inserted residues.
lines.append(line + '\n')
msa = ''.join(lines)
return self._build_profile(msa, model_construction='fast')
def _build_profile(self, msa: str, model_construction: str = 'fast') -> str:
"""Builds a HMM for the aligned sequences given as an MSA string.
Args:
msa: A string with the aligned sequences, in A3M or STO format.
model_construction: Whether to use reference annotation in the msa to
determine consensus columns ('hand') or default ('fast').
Returns:
A string with the profile in the HMM format.
Raises:
RuntimeError: If hmmbuild fails.
ValueError: If unspecified arguments are provided.
"""
if model_construction not in {'hand', 'fast'}:
raise ValueError(f'Invalid model_construction {model_construction} - only'
'hand and fast supported.')
with utils.tmpdir_manager(base_dir='/tmp') as query_tmp_dir:
input_query = os.path.join(query_tmp_dir, 'query.msa')
output_hmm_path = os.path.join(query_tmp_dir, 'output.hmm')
with open(input_query, 'w') as f:
f.write(msa)
cmd = [self.binary_path]
# If adding flags, we have to do so before the output and input:
if model_construction == 'hand':
cmd.append(f'--{model_construction}')
if self.singlemx:
cmd.append('--singlemx')
cmd.extend([
'--amino',
output_hmm_path,
input_query,
])
logging.info('Launching subprocess %s', cmd)
process = subprocess.Popen(cmd, stdout=subprocess.PIPE,
stderr=subprocess.PIPE)
with utils.timing('hmmbuild query'):
stdout, stderr = process.communicate()
retcode = process.wait()
logging.info('hmmbuild stdout:\n%s\n\nstderr:\n%s\n',
stdout.decode('utf-8'), stderr.decode('utf-8'))
if retcode:
raise RuntimeError('hmmbuild failed\nstdout:\n%s\n\nstderr:\n%s\n'
% (stdout.decode('utf-8'), stderr.decode('utf-8')))
with open(output_hmm_path, encoding='utf-8') as f:
hmm = f.read()
return hmm
# Copyright 2021 DeepMind Technologies Limited
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""A Python wrapper for hmmsearch - search profile against a sequence db."""
import os
import subprocess
from typing import Optional, Sequence
from absl import logging
# Internal import (7716).
from alphafold.data.tools import utils
class Hmmsearch(object):
"""Python wrapper of the hmmsearch binary."""
def __init__(self,
*,
binary_path: str,
database_path: str,
flags: Optional[Sequence[str]] = None):
"""Initializes the Python hmmsearch wrapper.
Args:
binary_path: The path to the hmmsearch executable.
database_path: The path to the hmmsearch database (FASTA format).
flags: List of flags to be used by hmmsearch.
Raises:
RuntimeError: If hmmsearch binary not found within the path.
"""
self.binary_path = binary_path
self.database_path = database_path
self.flags = flags
if not os.path.exists(self.database_path):
logging.error('Could not find hmmsearch database %s', database_path)
raise ValueError(f'Could not find hmmsearch database {database_path}')
def query(self, hmm: str) -> str:
"""Queries the database using hmmsearch using a given hmm."""
with utils.tmpdir_manager(base_dir='/tmp') as query_tmp_dir:
hmm_input_path = os.path.join(query_tmp_dir, 'query.hmm')
a3m_out_path = os.path.join(query_tmp_dir, 'output.a3m')
with open(hmm_input_path, 'w') as f:
f.write(hmm)
cmd = [
self.binary_path,
'--noali', # Don't include the alignment in stdout.
'--cpu', '8'
]
# If adding flags, we have to do so before the output and input:
if self.flags:
cmd.extend(self.flags)
cmd.extend([
'-A', a3m_out_path,
hmm_input_path,
self.database_path,
])
logging.info('Launching sub-process %s', cmd)
process = subprocess.Popen(
cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
with utils.timing(
f'hmmsearch ({os.path.basename(self.database_path)}) query'):
stdout, stderr = process.communicate()
retcode = process.wait()
if retcode:
raise RuntimeError(
'hmmsearch failed:\nstdout:\n%s\n\nstderr:\n%s\n' % (
stdout.decode('utf-8'), stderr.decode('utf-8')))
with open(a3m_out_path) as f:
a3m_out = f.read()
return a3m_out
......@@ -14,9 +14,12 @@
"""Library to run Jackhmmer from Python."""
from concurrent import futures
import glob
import os
import subprocess
from typing import Any, Mapping, Optional
from typing import Any, Callable, Mapping, Optional, Sequence
from urllib import request
from absl import logging
......@@ -40,7 +43,9 @@ class Jackhmmer:
filter_f2: float = 0.00005,
filter_f3: float = 0.0000005,
incdom_e: Optional[float] = None,
dom_e: Optional[float] = None):
dom_e: Optional[float] = None,
num_streamed_chunks: Optional[int] = None,
streaming_callback: Optional[Callable[[int], None]] = None):
"""Initializes the Python Jackhmmer wrapper.
Args:
......@@ -57,11 +62,15 @@ class Jackhmmer:
incdom_e: Domain e-value criteria for inclusion of domains in MSA/next
round.
dom_e: Domain e-value criteria for inclusion in tblout.
num_streamed_chunks: Number of database chunks to stream over.
streaming_callback: Callback function run after each chunk iteration with
the iteration number as argument.
"""
self.binary_path = binary_path
self.database_path = database_path
self.num_streamed_chunks = num_streamed_chunks
if not os.path.exists(self.database_path):
if not os.path.exists(self.database_path) and num_streamed_chunks is None:
logging.error('Could not find Jackhmmer database %s', database_path)
raise ValueError(f'Could not find Jackhmmer database {database_path}')
......@@ -75,9 +84,11 @@ class Jackhmmer:
self.incdom_e = incdom_e
self.dom_e = dom_e
self.get_tblout = get_tblout
self.streaming_callback = streaming_callback
def query(self, input_fasta_path: str) -> Mapping[str, Any]:
"""Queries the database using Jackhmmer."""
def _query_chunk(self, input_fasta_path: str, database_path: str
) -> Mapping[str, Any]:
"""Queries the database chunk using Jackhmmer."""
with utils.tmpdir_manager(base_dir='/tmp') as query_tmp_dir:
sto_path = os.path.join(query_tmp_dir, 'output.sto')
......@@ -114,13 +125,13 @@ class Jackhmmer:
cmd_flags.extend(['--incdomE', str(self.incdom_e)])
cmd = [self.binary_path] + cmd_flags + [input_fasta_path,
self.database_path]
database_path]
logging.info('Launching subprocess "%s"', ' '.join(cmd))
process = subprocess.Popen(
cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
with utils.timing(
f'Jackhmmer ({os.path.basename(self.database_path)}) query'):
f'Jackhmmer ({os.path.basename(database_path)}) query'):
_, stderr = process.communicate()
retcode = process.wait()
......@@ -145,3 +156,43 @@ class Jackhmmer:
e_value=self.e_value)
return raw_output
def query(self, input_fasta_path: str) -> Sequence[Mapping[str, Any]]:
"""Queries the database using Jackhmmer."""
if self.num_streamed_chunks is None:
return [self._query_chunk(input_fasta_path, self.database_path)]
db_basename = os.path.basename(self.database_path)
db_remote_chunk = lambda db_idx: f'{self.database_path}.{db_idx}'
db_local_chunk = lambda db_idx: f'/tmp/ramdisk/{db_basename}.{db_idx}'
# Remove existing files to prevent OOM
for f in glob.glob(db_local_chunk('[0-9]*')):
try:
os.remove(f)
except OSError:
print(f'OSError while deleting {f}')
# Download the (i+1)-th chunk while Jackhmmer is running on the i-th chunk
with futures.ThreadPoolExecutor(max_workers=2) as executor:
chunked_output = []
for i in range(1, self.num_streamed_chunks + 1):
# Copy the chunk locally
if i == 1:
future = executor.submit(
request.urlretrieve, db_remote_chunk(i), db_local_chunk(i))
if i < self.num_streamed_chunks:
next_future = executor.submit(
request.urlretrieve, db_remote_chunk(i+1), db_local_chunk(i+1))
# Run Jackhmmer with the chunk
future.result()
chunked_output.append(
self._query_chunk(input_fasta_path, db_local_chunk(i)))
# Remove the local copy of the chunk
os.remove(db_local_chunk(i))
future = next_future
if self.streaming_callback:
self.streaming_callback(i)
return chunked_output
......@@ -492,7 +492,7 @@ class StructureModule(hk.Module):
is_training=is_training,
safe_key=safe_key)
representations['structure_module'] = output['act']
ret['representations'] = {'structure_module': output['act']}
ret['traj'] = output['affine'] * jnp.array([1.] * 4 +
[c.position_scale] * 3)
......@@ -514,7 +514,8 @@ class StructureModule(hk.Module):
if self.compute_loss:
return ret
else:
no_loss_features = ['final_atom_positions', 'final_atom_mask']
no_loss_features = ['final_atom_positions', 'final_atom_mask',
'representations']
no_loss_ret = {k: ret[k] for k in no_loss_features}
return no_loss_ret
......
......@@ -237,6 +237,10 @@ class AlphaFoldIteration(hk.Module):
continue
else:
ret[name] = module(representations, batch, is_training)
if 'representations' in ret[name]:
# Extra representations from the head. Used by the structure module
# to provide activations for the PredictedLDDTHead.
representations.update(ret[name].pop('representations'))
if compute_loss:
total_loss += loss(module, head_config, ret, name)
......
# Copyright 2021 DeepMind Technologies Limited
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Alphafold model TensorFlow code."""
......@@ -146,13 +146,13 @@ def process_tensors_from_config(tensors, data_config):
num_ensemble *= data_config.common.num_recycle + 1
if isinstance(num_ensemble, tf.Tensor) or num_ensemble > 1:
dtype = tree.map_structure(lambda x: x.dtype,
tensors_0)
fn_output_signature = tree.map_structure(
tf.TensorSpec.from_tensor, tensors_0)
tensors = tf.map_fn(
lambda x: wrap_ensemble_fn(tensors, x),
tf.range(num_ensemble),
parallel_iterations=1,
dtype=dtype)
fn_output_signature=fn_output_signature)
else:
tensors = tree.map_structure(lambda x: x[None],
tensors_0)
......
......@@ -52,7 +52,7 @@ def _add_restraints(
stiffness: unit.Unit,
rset: str,
exclude_residues: Sequence[int]):
"""Adds a harmonic potential that restrains the end-to-end distance."""
"""Adds a harmonic potential that restrains the system to a structure."""
assert rset in ["non_hydrogen", "c_alpha"]
force = openmm.CustomExternalForce(
......
......@@ -54,7 +54,6 @@ class AmberMinimizeTest(absltest.TestCase):
max_attempts=1)
def test_iterative_relax(self):
# This test can occasionally fail because of nondeterminism in OpenMM.
prot = _load_test_protein(
'alphafold/relax/testdata/with_violations.pdb'
)
......
......@@ -48,7 +48,7 @@ def overwrite_b_factors(pdb_str: str, bfactors: np.ndarray) -> str:
raise ValueError(
f'Invalid final dimension size for bfactors: {bfactors.shape[-1]}.')
parser = PDB.PDBParser()
parser = PDB.PDBParser(QUIET=True)
handle = io.StringIO(pdb_str)
structure = parser.get_structure('', handle)
......
......@@ -54,7 +54,8 @@ RUN conda update -qy conda \
openmm=7.5.1 \
cudatoolkit==${CUDA}.3 \
pdbfixer \
pip
pip \
python=3.7
COPY . /app/alphafold
RUN wget -q -P /app/alphafold/alphafold/common/ \
......@@ -67,7 +68,7 @@ RUN pip3 install --upgrade pip \
https://storage.googleapis.com/jax-releases/jax_releases.html
# Apply OpenMM patch.
WORKDIR /opt/conda/lib/python3.8/site-packages
WORKDIR /opt/conda/lib/python3.7/site-packages
RUN patch -p0 < /app/alphafold/docker/openmm.patch
# We need to run `ldconfig` first to ensure GPUs are visible, due to some quirk
......
......@@ -57,13 +57,17 @@ uniref90_database_path = os.path.join(
# Path to the MGnify database for use by JackHMMER.
mgnify_database_path = os.path.join(
DOWNLOAD_DIR, 'mgnify', 'mgy_clusters.fa')
DOWNLOAD_DIR, 'mgnify', 'mgy_clusters_2018_08.fa')
# Path to the BFD database for use by HHblits.
bfd_database_path = os.path.join(
DOWNLOAD_DIR, 'bfd',
'bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt')
# Path to the Small BFD database for use by JackHMMER.
small_bfd_database_path = os.path.join(
DOWNLOAD_DIR, 'small_bfd', 'bfd-first_non_consensus_sequences.fasta')
# Path to the Uniclust30 database for use by HHblits.
uniclust30_database_path = os.path.join(
DOWNLOAD_DIR, 'uniclust30', 'uniclust30_2018_08', 'uniclust30_2018_08')
......@@ -92,10 +96,11 @@ flags.DEFINE_string('max_template_date', None, 'Maximum template release date '
'to consider (ISO-8601 format - i.e. YYYY-MM-DD). '
'Important if folding historical test sets.')
flags.DEFINE_enum('preset', 'full_dbs',
['full_dbs', 'casp14'],
'Choose preset model configuration - no ensembling with '
'uniref90 + bfd + uniclust30 (full_dbs), or '
'8 model ensemblings with uniref90 + bfd + uniclust30 '
['reduced_dbs', 'full_dbs', 'casp14'],
'Choose preset model configuration - no ensembling and '
'smaller genetic database config (reduced_dbs), no '
'ensembling and full genetic database config (full_dbs) or '
'full genetic database config and 8 model ensemblings '
'(casp14).')
flags.DEFINE_boolean('benchmark', False, 'Run multiple JAX model evaluations '
'to obtain a timing that excludes the compilation time, '
......@@ -131,14 +136,22 @@ def main(argv):
target_fasta_paths.append(target_path)
command_args.append(f'--fasta_paths={",".join(target_fasta_paths)}')
for name, path in [('uniref90_database_path', uniref90_database_path),
('mgnify_database_path', mgnify_database_path),
('uniclust30_database_path', uniclust30_database_path),
('bfd_database_path', bfd_database_path),
('pdb70_database_path', pdb70_database_path),
('data_dir', data_dir),
('template_mmcif_dir', template_mmcif_dir),
('obsolete_pdbs_path', obsolete_pdbs_path)]:
database_paths = [
('uniref90_database_path', uniref90_database_path),
('mgnify_database_path', mgnify_database_path),
('pdb70_database_path', pdb70_database_path),
('data_dir', data_dir),
('template_mmcif_dir', template_mmcif_dir),
('obsolete_pdbs_path', obsolete_pdbs_path),
]
if FLAGS.preset == 'reduced_dbs':
database_paths.append(('small_bfd_database_path', small_bfd_database_path))
else:
database_paths.extend([
('uniclust30_database_path', uniclust30_database_path),
('bfd_database_path', bfd_database_path),
])
for name, path in database_paths:
if path:
mount, target_path = _create_mount(name, path)
mounts.append(mount)
......
This diff is collapsed.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment