README.md

<!--
 * @Author: zhuww
 * @email: zhuww@sugon.com
 * @Date: 2023-03-31 17:09:07
 * @LastEditTime: 2023-04-01 09:53:01
 * @FilePath: \FastFold\README.md
-->
# FastFold
## 模型介绍
FastFold基于蛋白质结构预测模型,进行推理的性能优化
## 模型结构
模型基于Transformer架构,主要结构包括Evofomer(48 blocks)和Struture module(8 blocks)两个模块。
## 数据集
推荐使用AlphaFold2中的开源数据集，包括BFD、MGnify、PDB70、Uniclust、Uniref90等,数据集大小约3TB。

我们提供了一个脚本download_all_data.sh用于下载使用的数据集和模型文件：

    ./scripts/download_all_data.sh 数据集下载目录

## 推理
### 环境配置
提供[光源](https://www.sourcefind.cn/#/service-details)拉取推理的docker镜像：
* 推理镜像：docker pull image.sourcefind.cn:5000/dcu/admin/base/custom:fastfold-0.2.1-centos7.6-dtk-22.10-patch4-py38-latest

激活镜像环境：
`source /opt/dtk-22.10/env.sh`
`source /opt/openmm-dtk-22.10/env.sh`

测试目录：
`/opt/docker/test`

### 推理
我们分别提供了基于Pytorch的单体和多体的推理脚本，版本依赖：
* Pytorch(DCU版本) >= 1.10.0a0
#### 单体

    python inference.py T1024.fasta data/pdb_mmcif/mmcif_files/ \
    --output_dir ./ \
    --gpus 2 \
    --use_precomputed_alignments alignments/ \
    --param_path /data/params/params_model_1.npz  \
    --uniref90_database_path data/uniref90/uniref90.fasta \
    --mgnify_database_path data/mgnify/mgy_clusters_2018_12.fa \
    --pdb70_database_path data/pdb70/pdb70 \
    --uniclust30_database_path data/uniclust30/uniclust30_2018_08/uniclust30_2018_08 \
    --bfd_database_path data/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt \
    --jackhmmer_binary_path `which jackhmmer` \
    --hhblits_binary_path `which hhblits` \
    --hhsearch_binary_path `which hhsearch` \
    --kalign_binary_path `which kalign` \
    --chunk_size 4 \
    --inplace

或者使用`./inference.sh`

##### 单体推理参数说明
T1024.fasta为推理的单体序列；data修改为数据集下载目录；
--output_dir为输出目录；--gpus为使用的gpu数量；
--use_precomputed_alignments为搜索对齐目录，可以加载已经搜索对齐的序列，若不添加则进行搜索对齐；
--param_path为加载单体模型路径，需要和--model_name保持一致,默认为model_1；
--chunk_size为分块数量，设置为4，并且使用--inplace来降低显存占用。


Alphafold的数据预处理需要花费大量时间，因此我们通过[ray]加快了数据预处理(https://docs.ray.io/en/latest/workflows/concepts.html)工作流程。
要使用ray工作流运行推理，应将参数--enable_workflow添加到cmdline或`./inference.sh`脚本中。

#### 多体
    python inference.py SUGP1.fasta data/pdb_mmcif/mmcif_files/ \
    --output_dir ./ \
    --gpus 2 \
    --use_precomputed_alignments alignments/ \
    --model_preset multimer \
    --uniref90_database_path data/uniref90/uniref90.fasta \
    --mgnify_database_path data/mgnify/mgy_clusters_2018_12.fa \
    --pdb70_database_path data/pdb70/pdb70 \
    --uniclust30_database_path data/uniclust30/uniclust30_2018_08/uniclust30_2018_08 \
    --bfd_database_path data/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt \
    --uniprot_database_path data/uniprot/uniprot_sprot.fasta \
    --pdb_seqres_database_path data/pdb_seqres/pdb_seqres.txt  \
    --param_path data/params/params_model_1_multimer.npz \
    --model_name model_1_multimer \
    --jackhmmer_binary_path `which jackhmmer` \
    --hhblits_binary_path `which hhblits` \
    --hhsearch_binary_path `which hhsearch` \
    --kalign_binary_path `which kalign` \
    --chunk_size 4 \
    --inplace 

或者使用`./inference_multimer.sh`

##### 多体推理参数说明
SUGP1.fasta为推理的多体序列；--param_path为加载多体模型路径，需要和--model_name保持一致，其他参数同单体推理参数说明一致。


## 性能和准确率数据
测试数据：[casp14](https://www.predictioncenter.org/casp14/targetlist.cgi)、[uniprot](https://www.uniprot.org/)，使用的加速卡:DCU 1代-16G

性能数据：
| 卡数 | 数据类型 | 序列类型 | 序列标签 | 序列长度 | Speed(s) |
| :------: | :------: | :------: | :------: |:------: |:------: |
| 4 | fp32 | 单体 | T1024  | 408  | 80   |
| 4 | fp32 | 单体 | Q9NYK1 | 1046 | 488  |
| 4 | fp32 | 单体 | Q8CHI8 | 3072 | 8718 |
| 4 | fp32 | 多体 | H1036  | 856  | 324  |
| 4 | fp32 | 多体 | H1060  | 1106 | 449  |
| 4 | fp32 | 多体 | 6h3c   | 2030 | 2142 |

准确性数据：
| 卡数 | 数据类型 | 序列类型 | 序列标签 | 序列长度 | GDT-TS | GDT-HA | LDDT | TM score | MaxSub | RMSE |
| :------: | :------: | :------: | :------: |:------: |:------: | :------: | :------: | :------: |:------: |:------: |
| 4 | fp32 | 单体 | T1026  | 172  | 91.4 | 76.5 | 79.6 | 94.1 | 90.7 | 1.3 |
| 4 | fp32 | 单体 | T1053  | 580  | 93.7 | 78.2 | 92.3 | 98.4 | 92.9 | 1.1 |
| 4 | fp32 | 单体 | Q9NYK1 | 1046 | 90.1 | 74.4 | 86.6 | 96.2 | 90.5 | 5.8 |


## 历史版本
* https://developer.hpccube.com/codes/modelzoo/FastFold
## 参考
* [https://github.com/deepmind/alphafold](https://github.com/deepmind/alphafold)
* [https://github.com/hpcaitech/FastFold](https://github.com/hpcaitech/FastFold)