README.md

<!--
 * @Author: zhuww
 * @email: zhuww@sugon.com
 * @Date: 2023-03-31 17:09:07
 * @LastEditTime: 2023-04-25 14:07:01
-->
# FastFold
## 模型介绍
FastFold基于蛋白质结构预测模型,进行推理的性能优化
## 模型结构
模型基于Transformer架构,主要结构包括Evofomer(48 blocks)和Struture module(8 blocks)两个模块。
## 数据集
推荐使用AlphaFold2中的开源数据集，包括BFD、MGnify、PDB70、Uniclust、Uniref90等,数据集大小约3TB。

我们提供了一个脚本download_all_data.sh用于下载使用的数据集和模型文件：

    ./scripts/download_all_data.sh 数据集下载目录

## 推理
### 环境配置
提供[光源](https://www.sourcefind.cn/#/service-details)拉取推理的docker镜像：
* 推理镜像：docker pull image.sourcefind.cn:5000/dcu/admin/base/custom:fastfold-0.2.1-centos7.6-dtk-22.10-patch4-py38-latest

激活镜像环境：
`source /opt/dtk-22.10/env.sh`
`source /opt/openmm-dtk-22.10/env.sh`

测试目录：
`/opt/docker/test`

### 推理
我们分别提供了基于Pytorch的单体和多体的推理脚本，版本依赖：
* Pytorch(DCU版本) >= 1.10.0a0
#### 单体

    python inference.py T1024.fasta data/pdb_mmcif/mmcif_files/ \
    --output_dir ./ \
    --gpus 2 \
    --use_precomputed_alignments alignments/ \
    --param_path /data/params/params_model_1.npz  \
    --uniref90_database_path data/uniref90/uniref90.fasta \
    --mgnify_database_path data/mgnify/mgy_clusters_2018_12.fa \
    --pdb70_database_path data/pdb70/pdb70 \
    --uniclust30_database_path data/uniclust30/uniclust30_2018_08/uniclust30_2018_08 \
    --bfd_database_path data/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt \
    --jackhmmer_binary_path `which jackhmmer` \
    --hhblits_binary_path `which hhblits` \
    --hhsearch_binary_path `which hhsearch` \
    --kalign_binary_path `which kalign` \
    --chunk_size 4 \
    --inplace

或者使用`./inference.sh`

##### 单体推理参数说明
T1024.fasta为推理的单体序列；data修改为数据集下载目录；
`--output_dir`为输出目录；`--gpus`为使用的gpu数量；`--use_precomputed_alignments`为搜索对齐目录，可以加载已经搜索对齐的序列，若不添加则进行搜索对齐；
`--param_path`为加载单体模型路径，需要和`--model_name`保持一致,默认为model_1；`--chunk_size`为分块数量，设置为4，并且使用`--inplace`来降低显存占用；
默认进行relax操作，若不需要，添加`--relaxation`；默认不保存输出的.pkl文件，若需要，添加`--save_outputs`.


Alphafold的数据预处理需要花费大量时间，因此我们通过[ray](https://docs.ray.io/en/latest/workflows/concepts.html)加快了数据预处理工作流程。
要使用ray工作流运行推理，应将参数--enable_workflow添加到cmdline或`./inference.sh`脚本中。

#### 多体
    python inference.py SUGP1.fasta data/pdb_mmcif/mmcif_files/ \
    --output_dir ./ \
    --gpus 2 \
    --use_precomputed_alignments alignments/ \
    --model_preset multimer \
    --uniref90_database_path data/uniref90/uniref90.fasta \
    --mgnify_database_path data/mgnify/mgy_clusters_2018_12.fa \
    --pdb70_database_path data/pdb70/pdb70 \
    --uniclust30_database_path data/uniclust30/uniclust30_2018_08/uniclust30_2018_08 \
    --bfd_database_path data/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt \
    --uniprot_database_path data/uniprot/uniprot_sprot.fasta \
    --pdb_seqres_database_path data/pdb_seqres/pdb_seqres.txt  \
    --param_path data/params/params_model_1_multimer.npz \
    --model_name model_1_multimer \
    --jackhmmer_binary_path `which jackhmmer` \
    --hhblits_binary_path `which hhblits` \
    --hhsearch_binary_path `which hhsearch` \
    --kalign_binary_path `which kalign` \
    --chunk_size 4 \
    --inplace 

或者使用`./inference_multimer.sh`

##### 多体推理参数说明
SUGP1.fasta为推理的多体序列；`--param_path`为加载多体模型路径，需要和`--model_name`保持一致，其他参数同单体推理参数说明一致.

## 准确率数据
测试数据：[casp14](https://www.predictioncenter.org/casp14/targetlist.cgi)、[uniprot](https://www.uniprot.org/)，使用的加速卡:4张 DCU 1代-16G

准确性数据：
| 数据类型 | 序列类型 | 序列标签 | 序列长度 | GDT-TS | GDT-HA | LDDT | TM score | MaxSub | RMSD |
| :------: | :------: | :------: | :------: |:------: |:------: | :------: | :------: | :------: |:------: |
| fp32 | 单体 | T1026  | 172  | 0.914 | 0.765 | 79.634 | 0.941 | 0.907 | 1.289 |
| fp32 | 单体 | T1053  | 580  | 0.937 | 0.782 | 92.284 | 0.984 | 0.929 | 1.105 |
| fp32 | 单体 | Q9NYK1 | 1046 | 0.907 | 0.744 | 86.642 | 0.962 | 0.905 | 5.757 |

## 源码仓库及问题反馈
* https://developer.hpccube.com/codes/modelzoo/FastFold
## 参考
* [https://github.com/deepmind/alphafold](https://github.com/deepmind/alphafold)
* [https://github.com/hpcaitech/FastFold](https://github.com/hpcaitech/FastFold)